eBPF for Granular K8s Network Policy & High-Fidelity Observability

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond IP Tables: The Kernel-Level Revolution in Cloud-Native Networking

As senior engineers managing complex Kubernetes environments, we've all encountered the limitations of default NetworkPolicy. While essential for L3/L4 segmentation, it operates on a primitive understanding of traffic: IP addresses and ports. In a dynamic microservices architecture where pods are ephemeral and IP addresses are meaningless, this model quickly breaks down. We need identity-based, application-aware security.

The typical next step has been to adopt a service mesh like Istio. While powerful, this introduces significant operational complexity: sidecar injection, proxy configuration management, increased resource consumption, and an added layer of latency for every network call. What if we could achieve L7-aware policy enforcement and deep observability with near-zero overhead, directly within the Linux kernel?

This is the promise of eBPF (extended Berkeley Packet Filter). By attaching small, sandboxed programs to kernel hooks, we can inspect, filter, and even modify network packets before they traverse the traditional networking stack. This article is not an introduction to eBPF; it's a deep dive into its practical, production-grade application for security and observability in Kubernetes. We will dissect how tools like Cilium leverage eBPF for advanced L7 policies and then build our own custom eBPF-based observability tool from scratch using C and Go to solve a real-world debugging challenge.

The Kernel-Level Advantage: eBPF Data Plane vs. Sidecar Proxies

The fundamental difference between an eBPF-based data plane (like Cilium) and a sidecar-based one (like Istio) lies in the execution context. A sidecar is a user-space proxy running alongside your application container. An eBPF program is JIT-compiled and runs inside the kernel.

Data Path Comparison:

  • Sidecar Model (Istio/Envoy):
  • * Packet from Pod A leaves its network namespace.

    * iptables rules redirect the packet to the Envoy sidecar's listener in Pod A.

    * Envoy (user-space) processes the packet, applies L7 policies, and performs TLS termination/origination.

    * Envoy sends the packet back into the kernel to be routed to Pod B.

    * The packet arrives at Pod B's Envoy sidecar.

    * Envoy in Pod B processes the packet and forwards it to the application container via the loopback interface.

    This involves multiple kernel-to-user-space context switches, adding measurable latency (typically milliseconds) and consuming significant CPU/memory for each proxy.

  • eBPF Model (Cilium):
  • * Packet from Pod A is generated by the application.

    * An eBPF program attached to a low-level kernel hook (e.g., the TC hook or a socket hook) intercepts the packet.

    * The eBPF program, running in the kernel context, inspects the packet.

    * It makes a policy decision based on pod identity (derived from CNI information) and L7 data (by parsing protocols like HTTP/gRPC in-kernel).

    * If allowed, the packet is forwarded directly to Pod B's network interface, often bypassing large parts of the iptables and upper networking stack.

    This model is orders of magnitude more efficient. Context switches are eliminated, and since the logic runs in the kernel, the security boundary is stronger—policy is enforced before the packet even enters the target pod's network namespace.

    MetricSidecar Proxy (Istio)eBPF Data Plane (Cilium)Advantage
    Added Latency1-10 ms per hop< 1 ms per hopeBPF (10x-100x lower latency)
    CPU OverheadHigh (per-proxy)Low (shared per-node)eBPF (Dramatically lower resource cost)
    Memory OverheadHigh (per-proxy)Low (shared per-node)eBPF (Efficient memory usage via maps)
    Security ContextUser-spaceKernel-spaceeBPF (Earlier, more secure enforcement)

    Deep Dive: Implementing L7-Aware Policies with Cilium

    Let's move from theory to a concrete production scenario. Imagine a multi-tenant environment with a billing-service. We need to enforce a critical security rule: only pods with the label app: payment-processor are allowed to make POST requests to the /api/v1/charge endpoint. All other pods, even if they are in the same namespace, should be blocked from accessing this specific endpoint, and the payment-processor itself should be blocked from accessing any other endpoints on the billing-service.

    A standard Kubernetes NetworkPolicy is useless here. It can restrict access to the billing-service pod on its port, but it has no visibility into the HTTP method or path.

    CiliumNetworkPolicy Implementation

    Cilium extends Kubernetes with a CiliumNetworkPolicy CRD that leverages eBPF to understand application-layer protocols. Here is the manifest to enforce our rule:

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "billing-service-l7-policy"
      namespace: "finance"
    spec:
      endpointSelector:
        matchLabels:
          app: billing-service
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: payment-processor
        toPorts:
        - ports:
          - port: "8080"
            protocol: TCP
          rules:
            http:
            - method: "POST"
              path: "/api/v1/charge"

    Deconstructing the eBPF Magic

    How does this YAML translate to kernel-level enforcement?

  • Identity Mapping: The Cilium agent on each node maintains an eBPF map that correlates pod IPs to a secure, node-local numeric identity based on its labels. This is far more efficient than managing ipBlock rules in iptables that constantly change.
  • eBPF Program Attachment: When this policy is applied, Cilium attaches eBPF programs to the Traffic Control (TC) hook on the veth pair of the billing-service pod. This hook allows eBPF to see all network packets entering or leaving the pod's network namespace.
  • In-Kernel L7 Parsing: The attached eBPF program is not a simple packet filter. It contains a parser for the HTTP protocol. When a new TCP connection is established on port 8080, the eBPF program begins to buffer the initial packets of the stream. It reassembles TCP segments in-kernel to reconstruct the HTTP request line and headers.
  • Policy Decision: Once it has parsed the POST /api/v1/charge HTTP/1.1 line, it checks its policy rules (also stored in an eBPF map). It looks up the source pod's numeric identity and confirms it corresponds to app: payment-processor. It matches the method and path against the allowlist. If all conditions match, it allows the packet to proceed. If not, it drops the packet and can even inject a TCP reset to terminate the connection cleanly. All of this happens in microseconds, within the kernel, without the application or any user-space proxy ever seeing the denied request.
  • This is the power of eBPF: expressing high-level, application-aware rules that are compiled down to highly efficient, sandboxed kernel bytecode.

    Building a Custom eBPF Observability Tool for DNS Latency

    While Cilium is phenomenal, the true power of eBPF is realized when you build custom tools to solve unique problems. Let's tackle a common and frustrating issue: debugging intermittent service discovery latency.

    Problem: A service is experiencing occasional high latency. You suspect slow DNS lookups from within the pod, but instrumenting every application with DNS timing metrics is impractical, and tools like tcpdump are too crude and generate massive files that are difficult to analyze at scale.

    Solution: We will build a lightweight, eBPF-based tool that traces all DNS queries (over UDP port 53) originating from a node, measures the latency between the request and response, and reports any queries exceeding a certain threshold. We will use C for the eBPF kernel program and Go with the libbpf-go library for the user-space controller.

    1. The eBPF Kernel Program (C)

    This C code will be compiled into eBPF bytecode. We'll use kprobes to attach to kernel functions responsible for sending and receiving UDP packets.

    dns_tracker.c

    c
    // SPDX-License-Identifier: GPL-2.0
    #include <vmlinux.h>
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_tracing.h>
    #include <bpf/bpf_core_read.h>
    
    // Max DNS packet size
    #define MAX_DNS_SIZE 512
    
    // Key for tracking ongoing DNS queries
    struct dns_query_key_t {
        u32 net_ns;
        u32 pid;
        u16 sport;
        u16 id; // DNS transaction ID
    };
    
    // Value storing the start timestamp
    struct dns_query_val_t {
        u64 ts;
    };
    
    // Data sent to user-space via perf buffer
    struct dns_event_t {
        u64 latency_ns;
        u32 pid;
        u32 net_ns;
        u16 id;
        char comm[TASK_COMM_LEN];
        char qname[128];
    };
    
    // Hash map to store start timestamps of DNS queries
    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 4096);
        __type(key, struct dns_query_key_t);
        __type(value, struct dns_query_val_t);
    } ongoing_dns_queries SEC(".maps");
    
    // Perf event array to send data to user-space
    struct {
        __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
        __uint(key_size, sizeof(int));
        __uint(value_size, sizeof(u32));
    } events SEC(".maps");
    
    // Helper to get network namespace inode number
    static __always_inline u32 get_net_ns(struct sock *sk) {
        struct sk_buff *skb;
        if (!sk) return 0;
        return BPF_CORE_READ(sk, sk_net.net, ns.inum);
    }
    
    // Attach to the entry of the udp_send_skb function
    SEC("kprobe/udp_send_skb")
    int BPF_KPROBE(kprobe__udp_send_skb, struct sk_buff *skb) {
        struct udphdr *udph;
        void *data;
        u16 dport, sport, id;
        u64 pid_tgid = bpf_get_current_pid_tgid();
        u32 pid = pid_tgid >> 32;
    
        // Load UDP header
        bpf_skb_load_bytes(skb, skb->transport_header, &udph, sizeof(*udph));
        dport = bpf_ntohs(udph->dest);
    
        // We only care about DNS queries
        if (dport != 53) {
            return 0;
        }
    
        // Load DNS header to get transaction ID
        data = (void *)udph + sizeof(*udph);
        bpf_skb_load_bytes(skb, skb->transport_header + sizeof(*udph), &id, sizeof(id));
    
        struct dns_query_key_t key = {};
        key.net_ns = get_net_ns(skb->sk);
        key.pid = pid;
        key.sport = bpf_ntohs(udph->source);
        key.id = id;
    
        struct dns_query_val_t val = {};
        val.ts = bpf_ktime_get_ns();
    
        bpf_map_update_elem(&ongoing_dns_queries, &key, &val, BPF_ANY);
        return 0;
    }
    
    // Attach to the entry of the udp_queue_rcv_skb function
    SEC("kprobe/udp_queue_rcv_skb")
    int BPF_KPROBE(kprobe__udp_queue_rcv_skb, struct sock *sk, struct sk_buff *skb) {
        struct udphdr *udph;
        void *data;
        u16 sport, dport, id;
        u64 pid_tgid = bpf_get_current_pid_tgid();
        u32 pid = pid_tgid >> 32;
    
        bpf_skb_load_bytes(skb, skb->transport_header, &udph, sizeof(*udph));
        sport = bpf_ntohs(udph->source);
    
        // We only care about DNS responses
        if (sport != 53) {
            return 0;
        }
    
        data = (void *)udph + sizeof(*udph);
        bpf_skb_load_bytes(skb, skb->transport_header + sizeof(*udph), &id, sizeof(id));
    
        struct dns_query_key_t key = {};
        key.net_ns = get_net_ns(sk);
        key.pid = pid; // Note: This PID might not be the original requester's PID. 
                       // We use net_ns and sport for more reliable matching.
        key.sport = bpf_ntohs(udph->dest);
        key.id = id;
    
        struct dns_query_val_t *val_ptr;
        val_ptr = bpf_map_lookup_elem(&ongoing_dns_queries, &key);
        if (!val_ptr) {
            return 0;
        }
    
        u64 start_ts = val_ptr->ts;
        u64 end_ts = bpf_ktime_get_ns();
        bpf_map_delete_elem(&ongoing_dns_queries, &key);
    
        u64 latency = end_ts - start_ts;
    
        // Only report slow queries (e.g., > 50ms)
        if (latency < 50000000) {
            return 0;
        }
    
        struct dns_event_t event = {};
        event.latency_ns = latency;
        event.pid = key.pid; // Use the original PID from the key
        event.net_ns = key.net_ns;
        event.id = key.id;
        bpf_get_current_comm(&event.comm, sizeof(event.comm));
    
        // In a real tool, we would parse the qname from the packet data here.
        // For brevity, we'll skip the complex parsing logic.
        // bpf_skb_load_bytes(...)
    
        bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
    
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Concepts in the C Code:

    * vmlinux.h: This header is generated by bpftool and contains all kernel type definitions. It's essential for CO-RE (Compile Once – Run Everywhere) to work, making our program portable across different kernel versions.

    * kprobe: A dynamic tracing mechanism to attach our eBPF program to the entry (kprobe) or exit (kretprobe) of almost any kernel function.

    * BPF_MAP_TYPE_HASH: A key-value store accessible from both our eBPF program and user-space. We use it to store the timestamp when a DNS query is sent.

    * BPF_MAP_TYPE_PERF_EVENT_ARRAY: A high-performance, lockless way to send data from the kernel program to our user-space application.

    * bpf_ktime_get_ns(): A helper function to get a monotonic timestamp.

    * CO-RE Helpers (BPF_CORE_READ): Safely read kernel struct members, even if the struct layout changes between kernel versions.

    2. The User-space Controller (Go)

    This Go application will load, attach, and listen for events from our eBPF program.

    main.go

    go
    package main
    
    import (
    	"bytes"
    	"encoding/binary"
    	"errors"
    	"fmt"
    	"log"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/cilium/ebpf/link"
    	"github.com/cilium/ebpf/perf"
    	"golang.org/x/sys/unix"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall -Werror" bpf dns_tracker.c -- -I./headers
    
    const TASK_COMM_LEN = 16
    
    type dnsEventT struct {
    	LatencyNs uint64
    	Pid       uint32
    	NetNs     uint32
    	Id        uint16
    	Comm      [TASK_COMM_LEN]byte
    	Qname     [128]byte
    }
    
    func main() {
    	// Handle Ctrl+C
    	stopper := make(chan os.Signal, 1)
    	signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
    
    	// Increase rlimit memory lock for eBPF maps
    	if err := unix.Setrlimit(unix.RLIMIT_MEMLOCK, &unix.Rlimit{Cur: unix.RLIM_INFINITY, Max: unix.RLIM_INFINITY}); err != nil {
    		log.Fatalf("failed to set rlimit: %v", err)
    	}
    
    	// Load pre-compiled BPF objects
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %v", err)
    	}
    	defer objs.Close()
    
    	// Attach kprobe for sending UDP packets
    	kpSend, err := link.Kprobe("udp_send_skb", objs.KprobeUdpSendSkb, nil)
    	if err != nil {
    		log.Fatalf("attaching kprobe udp_send_skb: %v", err)
    	}
    	defer kpSend.Close()
    
    	// Attach kprobe for receiving UDP packets
    	kpRecv, err := link.Kprobe("udp_queue_rcv_skb", objs.KprobeUdpQueueRcvSkb, nil)
    	if err != nil {
    		log.Fatalf("attaching kprobe udp_queue_rcv_skb: %v", err)
    	}
    	defer kpRecv.Close()
    
    	// Open a perf event reader from the BPF map
    	rd, err := perf.NewReader(objs.Events, os.Getpagesize())
    	if err != nil {
    		log.Fatalf("creating perf event reader: %v", err)
    	}
    	defer rd.Close()
    
    	go func() {
    		<-stopper
    		log.Println("Received signal, exiting...")
    		rd.Close()
    	}()
    
    	log.Println("Waiting for events... Press Ctrl+C to exit.")
    
    	var event dnsEventT
    	for {
    		record, err := rd.Read()
    		if err != nil {
    			if errors.Is(err, perf.ErrClosed) {
    				return
    			}
    			log.Printf("reading from perf buffer: %s", err)
    			continue
    		}
    
    		if record.LostSamples > 0 {
    			log.Printf("perf buffer dropped %d samples", record.LostSamples)
    			continue
    		}
    
    		// Parse the raw data into our struct
    		if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
    			log.Printf("parsing perf event: %s", err)
    			continue
    		}
    
    		fmt.Printf("Slow DNS Query Detected! Latency: %.2fms, PID: %d, Comm: %s, NetNS: %d\n",
    			float64(event.LatencyNs)/1000000.0,
    			event.Pid,
    			unix.ByteSliceToString(event.Comm[:]),
    			event.NetNs,
    		)
    	}
    }

    3. Compilation and Deployment in Kubernetes

    Build Steps:

  • Generate vmlinux.h: bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
  • Generate Go bindings: go generate will execute the bpf2go command, compiling dns_tracker.c and embedding it into a Go file (bpf_bpfel.go).
  • Build the Go binary: go build -o dns-tracker
  • Production Deployment Pattern:

    This tool must run on every node to capture all DNS activity. The correct Kubernetes pattern is a DaemonSet.

    yaml
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: dns-tracker
      namespace: kube-system
      labels:
        app: dns-tracker
    spec:
      selector:
        matchLabels:
          app: dns-tracker
      template:
        metadata:
          labels:
            app: dns-tracker
        spec:
          hostPID: true
          hostNetwork: true
          tolerations:
          - operator: Exists
          containers:
          - name: dns-tracker-container
            image: your-repo/dns-tracker:latest
            securityContext:
              privileged: true
            volumeMounts:
            - name: bpf-fs
              mountPath: /sys/fs/bpf
              readOnly: false
          volumes:
          - name: bpf-fs
            hostPath:
              path: /sys/fs/bpf

    Critical DaemonSet settings:

    * hostPID: true: Allows the process to see all PIDs on the host, necessary for our kprobes.

    * securityContext.privileged: true: Required to load eBPF programs and access kernel debugging features.

    * Volume Mount for /sys/fs/bpf: This is the virtual filesystem where eBPF maps and programs are pinned, allowing them to persist.

    Running this DaemonSet across your cluster provides a powerful, low-overhead tool for instantly identifying which pods and processes are suffering from slow DNS resolution, a task that is notoriously difficult with traditional tools.

    Advanced Edge Cases and Performance Considerations

    Building robust eBPF tooling requires a deep understanding of the kernel and the eBPF runtime's constraints.

    * The eBPF Verifier: Before any eBPF program is loaded, it undergoes a rigorous static analysis by the kernel's verifier. The verifier ensures the program is safe to run by checking for unbounded loops, out-of-bounds memory access, and null pointer dereferences. A common mistake is iterating over a packet's data without explicit boundary checks, which the verifier will reject. For example, a for loop without a #pragma unroll directive or a hard-coded upper bound will fail verification.

    * Map Contention at Scale: Our ongoing_dns_queries map is a BPF_MAP_TYPE_HASH. On a node with tens of thousands of DNS queries per second, this single map can become a point of lock contention as multiple CPUs try to update it simultaneously. For such high-throughput scenarios, a better choice is BPF_MAP_TYPE_PERCPU_HASH. This map type creates a separate hash map for each CPU. The eBPF program updates its local CPU's map, which is a lock-free operation. User-space then iterates over all per-CPU maps to get the full picture. This trades a small amount of memory for a significant performance gain.

    * Kernel Version Dependencies and CO-RE: Historically, a major pain point of eBPF was that programs had to be recompiled for the specific kernel version they were running on due to changes in kernel data structures. CO-RE (Compile Once – Run Everywhere) solves this. By using vmlinux.h which contains BTF (BPF Type Format) data, libbpf can perform runtime relocations. It understands the structure of the kernel on the target machine and automatically adjusts memory offsets in your eBPF program to match. This is why our C code uses BPF_CORE_READ—it's a macro that enables this relocation magic.

    * Tail Calls for Complex Logic: eBPF programs have a stack size limit of 512 bytes and an instruction limit (1 million instructions since kernel 5.3). To implement complex logic, such as a multi-stage protocol parser, you can use bpf_tail_call. This function effectively replaces the current eBPF program with another one, similar to an exec() system call, without returning to the original program. This allows you to chain programs together to perform complex tasks that would otherwise exceed the verifier's limits.

    Conclusion: eBPF as the Future of Cloud-Native Infrastructure

    eBPF is not just another tool; it represents a fundamental shift in how we build and manage cloud-native systems. It allows us to push observability, security, and networking logic into the most privileged and performant location: the kernel itself.

    We've seen how high-level tools like Cilium use eBPF to provide sophisticated L7 policies that outperform sidecar proxies. More importantly, we've demonstrated that senior engineers can and should leverage eBPF directly to build bespoke solutions for their unique challenges, like our DNS latency tracker.

    Understanding the principles of eBPF programming, the role of the verifier, the importance of CO-RE, and advanced concepts like map types and tail calls is becoming a critical skill. As the ecosystem matures with projects like Tetragon for security observability and Parca for continuous profiling, proficiency in eBPF will be a defining characteristic of the next generation of elite infrastructure and security engineers.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles