eBPF for K8s: TC Hook vs. XDP for Network Policy Enforcement

22 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Core Dilemma: Kernel Hooking for Pod-to-Pod Policy

As organizations scale their Kubernetes clusters, the performance limitations of iptables-based network policy enforcement become a significant bottleneck. The kernel's connection tracking system (conntrack) and the linear nature of iptables chains introduce latency and CPU overhead that simply don't scale to thousands of pods per node. This has driven the adoption of eBPF, which allows for programmable, high-performance packet processing directly within the kernel.

Mature CNI projects like Cilium and Calico have demonstrated the power of eBPF, but for engineers building custom networking solutions or needing to understand the underlying mechanics, a critical design decision emerges: where in the kernel's networking stack should we attach our eBPF programs? The two primary candidates for implementing pod-to-pod NetworkPolicy are the Traffic Control (TC) subsystem and the Express Data Path (XDP).

This article assumes you are familiar with Kubernetes CNI, NetworkPolicy resources, and the basics of eBPF. We will not cover introductory concepts. Instead, we will dissect the specific architectural and performance trade-offs between a TC-based and an XDP-based approach for enforcing L3/L4 network policies between pods connected via a veth pair.

Our analysis will focus on:

  • Packet Path and Context: Where each hook lives and what data (e.g., sk_buff vs. xdp_md) is available to the BPF program.
  • Implementation Complexity: A side-by-side comparison of C BPF programs and the Go userspace controllers needed to load and manage them.
  • Performance Benchmarking: A practical methodology and code for measuring the per-packet CPU cost and throughput differences.
  • Advanced Scenarios & Edge Cases: Handling atomic policy updates, observability, and the limitations that force a choice for L7-aware policies.
  • Let's visualize the packet flow to frame the discussion:

    text
              User Space (e.g., Pod Application)
                       ^
                       | Socket Layer
    +------------------|-------------------------------------------------+
    |                  v                                                 |
    |  Kernel Space    TCP/IP Stack (L4/L3 Processing, conntrack, etc)   |
    |                  ^
    |                  | 
    |  +---------------|----------------+  <-- [TC Hook Point: ingress/egress]
    |  |               v                |      (Operates on sk_buff)
    |  |         Queue Discipline       |
    |  |               |                |
    |  +---------------|----------------+
    |                  v 
    |  +---------------|----------------+  <-- [XDP Hook Point]
    |  |               v                |      (Operates on xdp_md)
    |  |         Network Driver         |
    |  +---------------|----------------+
    |                  |
    +------------------|-------------------------------------------------+
                       v
                 Physical/Virtual NIC (e.g., veth, eth0)

    XDP hooks run at the earliest possible point, directly in the driver, before the kernel even allocates a sk_buff (socket buffer). TC hooks run much later, after the packet has been encapsulated in an sk_buff and passed through parts of the IP stack. This fundamental difference is the source of all subsequent trade-offs.


    Deep Dive: The Traffic Control (TC) Hook Approach

    The TC subsystem has long been the standard Linux framework for traffic shaping and queueing. The cls_bpf classifier allows us to attach an eBPF program to a qdisc (queueing discipline), which can be attached to the ingress or egress of a network interface. For Kubernetes, this means attaching to each pod's veth interface.

    Implementation Pattern

    The core of the TC approach is an eBPF program that receives the full struct __sk_buff * as its context. This gives it access to nearly all metadata associated with the packet, making it incredibly flexible.

    1. The eBPF C Program (tc_policy.c)

    This program defines a simple L4 policy: it allows or denies traffic based on a destination port stored in a BPF map. This map will be populated by our userspace controller.

    c
    // SPDX-License-Identifier: GPL-2.0
    #include <linux/bpf.h>
    #include <linux/if_ether.h>
    #include <linux/ip.h>
    #include <linux/tcp.h>
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_endian.h>
    
    // BPF map to store allowed destination ports.
    // Key: destination port (u16)
    // Value: 1 if allowed, 0 if not (we only need to store allowed ports)
    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 1024);
        __type(key, __u16);
        __type(value, __u8);
    } policy_map SEC(".maps");
    
    // BPF map for observability: count dropped packets.
    struct {
        __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
        __uint(max_entries, 1);
        __type(key, __u32);
        __type(value, __u64);
    } drop_count_map SEC(".maps");
    
    SEC("classifier")
    int enforce_policy(struct __sk_buff *skb) {
        void *data_end = (void *)(long)skb->data_end;
        void *data = (void *)(long)skb->data;
    
        // L2 Header
        struct ethhdr *eth = data;
        if ((void*)eth + sizeof(*eth) > data_end) {
            return TC_ACT_OK; // Not a valid ethernet frame
        }
    
        // We only care about IPv4 for this example
        if (eth->h_proto != bpf_htons(ETH_P_IP)) {
            return TC_ACT_OK;
        }
    
        // L3 Header
        struct iphdr *ip = data + sizeof(*eth);
        if ((void*)ip + sizeof(*ip) > data_end) {
            return TC_ACT_OK;
        }
    
        // We only care about TCP
        if (ip->protocol != IPPROTO_TCP) {
            return TC_ACT_OK;
        }
    
        // L4 Header
        struct tcphdr *tcp = (void*)ip + sizeof(*ip);
        if ((void*)tcp + sizeof(*tcp) > data_end) {
            return TC_ACT_OK;
        }
    
        __u16 dport = bpf_ntohs(tcp->dest);
    
        // Check policy map
        __u8 *allowed = bpf_map_lookup_elem(&policy_map, &dport);
        if (allowed) {
            // Port is in the allowed list
            bpf_printk("TC: Allowing packet to dport %d", dport);
            return TC_ACT_OK; // TC_ACT_OK means pass the packet
        }
    
        // Default drop if not explicitly allowed
        bpf_printk("TC: Dropping packet to dport %d", dport);
    
        // Increment drop counter for observability
        __u32 key = 0;
        __u64 *count = bpf_map_lookup_elem(&drop_count_map, &key);
        if (count) {
            __sync_fetch_and_add(count, 1);
        }
    
        return TC_ACT_SHOT; // TC_ACT_SHOT means drop the packet
    }
    
    char __license[] SEC("license") = "GPL";

    2. The Go Userspace Controller (main_tc.go)

    This Go application is responsible for compiling the C code (using bpf2go), loading the resulting eBPF object into the kernel, creating the necessary qdisc (clsact), and attaching the program to a specific network interface's ingress hook.

    go
    package main
    
    import (
    	"fmt"
    	"log"
    	"net"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/cilium/ebpf"
    	"github.com/cilium/ebpf/link"
    	"github.com/vishvananda/netlink"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf tc_policy.c -- -I./headers
    
    func main() {
    	if len(os.Args) < 2 {
    		log.Fatalf("Usage: %s <interface-name>", os.Args[0])
    	}
    	ifaceName := os.Args[1]
    
    	// Look up the network interface by name.
    	iface, err := net.InterfaceByName(ifaceName)
    	if err != nil {
    		log.Fatalf("lookup network iface %q: %s", ifaceName, err)
    	}
    
    	// Load pre-compiled programs and maps into the kernel.
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %s", err)
    	}
    	defer objs.Close()
    
    	// Add a clsact qdisc to the interface. This is a special qdisc that
    	// acts as a container for BPF classifiers on ingress and egress.
    	qdisc := &netlink.Clsact{QdiscAttrs: netlink.QdiscAttrs{
    		LinkIndex: iface.Index,
    		Handle:    netlink.MakeHandle(0xffff, 0),
    		Parent:    netlink.HANDLE_CLSACT,
    	}}
    	if err := netlink.QdiscAdd(qdisc); err != nil {
    		// It's fine if it already exists.
    		if !os.IsExist(err) {
    			log.Fatalf("cannot add clsact qdisc: %v", err)
    		}
    	}
    
    	// Attach the BPF program to the ingress hook of the TC qdisc.
    	// We are filtering incoming traffic to the pod's veth.
    	l, err := link.AttachTCX(
    		link.TCXOptions{
    			Program:   objs.EnforcePolicy,
    			Interface: iface.Index,
    			Attach:    ebpf.AttachTCXIngress,
    		},
    	)
    	if err != nil {
    		log.Fatalf("could not attach TC program: %s", err)
    	}
    	defer l.Close()
    
    	log.Printf("Attached TC program to iface %q (index %d)", iface.Name, iface.Index)
    
    	// *** Production Pattern: Atomic Policy Updates ***
    	// We will allow traffic to port 8080 and 9090.
    	allowedPorts := []uint16{8080, 9090}
    	var one uint8 = 1
    	for _, port := range allowedPorts {
    		if err := objs.PolicyMap.Put(port, one); err != nil {
    			log.Fatalf("failed to update policy map: %v", err)
    		}
    	}
    	log.Printf("Policy updated: allowing traffic to ports %v", allowedPorts)
    
    	// Wait for a signal to exit.
    	stop := make(chan os.Signal, 1)
    	signal.Notify(stop, os.Interrupt, syscall.SIGTERM)
    	<-stop
    
    	log.Println("Received signal, exiting...")
    }

    State Management & Atomic Updates

    In a real Kubernetes CNI, NetworkPolicy objects are created, updated, and deleted continuously. The userspace controller (agent) must translate these changes into BPF map entries without disrupting existing connections or introducing race conditions.

    The pattern shown above—simply calling Put on the map—is sufficient for this simple hash map. For more complex policies involving thousands of rules, a common production pattern is to use double-buffering. Two identical maps are created. The BPF program reads from the 'active' map. When a policy update occurs, the controller populates the 'inactive' map with the complete new ruleset. Once ready, it atomically swaps which map the BPF program points to using a BPF_MAP_TYPE_PROG_ARRAY or by updating a global variable. This ensures there's never a state where a partial policy is being enforced.

    Pros & Cons of TC

    * Pros:

    * Full Context: Access to the sk_buff provides rich metadata, including socket information, L4 headers, and even packet payload (with care), making it suitable for stateful firewalls and L7-aware policy stubs.

    * Flexibility: Can modify packets in complex ways, redirect them to other interfaces, or even clone them for monitoring.

    * Maturity: The TC subsystem is a well-established part of the kernel.

    * Cons:

    * Performance Overhead: It executes later in the stack. The kernel has already spent CPU cycles allocating the sk_buff and performing initial IP-level processing before our BPF program even runs. For simple drop/allow decisions, this is wasted work.


    Deep Dive: The Express Data Path (XDP) Hook Approach

    XDP is designed for the highest possible packet processing performance. By attaching to the network driver, it can make decisions before the packet even hits the main kernel networking stack.

    Implementation Pattern

    The XDP program operates on a struct xdp_md *, a much lighter-weight context than the sk_buff. It contains pointers to the raw packet data and not much else.

    1. The eBPF C Program (xdp_policy.c)

    The structure is similar to the TC program, but the context and return values are different.

    c
    // SPDX-License-Identifier: GPL-2.0
    #include <linux/bpf.h>
    #include <linux/if_ether.h>
    #include <linux/ip.h>
    #include <linux/tcp.h>
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_endian.h>
    
    // Maps are identical to the TC example
    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 1024);
        __type(key, __u16);
        __type(value, __u8);
    } policy_map SEC(".maps");
    
    struct {
        __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
        __uint(max_entries, 1);
        __type(key, __u32);
        __type(value, __u64);
    } drop_count_map SEC(".maps");
    
    SEC("xdp")
    int enforce_policy(struct xdp_md *ctx) {
        void *data_end = (void *)(long)ctx->data_end;
        void *data = (void *)(long)ctx->data;
    
        // Same parsing logic as before
        struct ethhdr *eth = data;
        if ((void*)eth + sizeof(*eth) > data_end) {
            return XDP_PASS;
        }
    
        if (eth->h_proto != bpf_htons(ETH_P_IP)) {
            return XDP_PASS;
        }
    
        struct iphdr *ip = data + sizeof(*eth);
        if ((void*)ip + sizeof(*ip) > data_end) {
            return XDP_PASS;
        }
    
        if (ip->protocol != IPPROTO_TCP) {
            return XDP_PASS;
        }
    
        struct tcphdr *tcp = (void*)ip + sizeof(*ip);
        if ((void*)tcp + sizeof(*tcp) > data_end) {
            return XDP_PASS;
        }
    
        __u16 dport = bpf_ntohs(tcp->dest);
    
        // Check policy map
        __u8 *allowed = bpf_map_lookup_elem(&policy_map, &dport);
        if (allowed) {
            bpf_printk("XDP: Allowing packet to dport %d", dport);
            return XDP_PASS;
        }
    
        bpf_printk("XDP: Dropping packet to dport %d", dport);
    
        // Increment drop counter
        __u32 key = 0;
        __u64 *count = bpf_map_lookup_elem(&drop_count_map, &key);
        if (count) {
            __sync_fetch_and_add(count, 1);
        }
    
        return XDP_DROP;
    }
    
    char __license[] SEC("license") = "GPL";

    2. The Go Userspace Controller (main_xdp.go)

    The controller is simpler. We don't need to manage qdiscs; we just attach the program directly to the interface.

    go
    package main
    
    import (
    	"log"
    	"net"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/cilium/ebpf/link"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf xdp_policy.c -- -I./headers
    
    func main() {
    	if len(os.Args) < 2 {
    		log.Fatalf("Usage: %s <interface-name>", os.Args[0])
    	}
    	ifaceName := os.Args[1]
    
    	iface, err := net.InterfaceByName(ifaceName)
    	if err != nil {
    		log.Fatalf("lookup network iface %q: %s", ifaceName, err)
    	}
    
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %s", err)
    	}
    	defer objs.Close()
    
    	// Attach the XDP program.
    	// The flag 0 means to use the default mode, which is typically
    	// XDP_MODE_NATIVE if the driver supports it, or XDP_MODE_SKB as a fallback.
    	l, err := link.AttachXDP(link.XDPAttachOptions{
    		Program:   objs.EnforcePolicy,
    		Interface: iface.Index,
    	})
    	if err != nil {
    		log.Fatalf("could not attach XDP program: %s", err)
    	}
    	defer l.Close()
    
    	log.Printf("Attached XDP program to iface %q (index %d)", iface.Name, iface.Index)
    
    	// Policy update logic is identical to the TC example
    	allowedPorts := []uint16{8080, 9090}
    	var one uint8 = 1
    	for _, port := range allowedPorts {
    		if err := objs.PolicyMap.Put(port, one); err != nil {
    			log.Fatalf("failed to update policy map: %v", err)
    		}
    	}
    	log.Printf("Policy updated: allowing traffic to ports %v", allowedPorts)
    
    	stop := make(chan os.Signal, 1)
    	signal.Notify(stop, os.Interrupt, syscall.SIGTERM)
    	<-stop
    
    	log.Println("Received signal, exiting...")
    }

    Edge Case: `XDP_MODE_NATIVE` vs. `XDP_MODE_SKB` (Generic XDP)

    This is the most critical detail for XDP performance in a Kubernetes context.

    * XDP_MODE_NATIVE: The BPF program runs directly in the network driver's receive path. This is the fastest mode, offering the lowest latency.

    XDP_MODE_SKB (or Generic XDP): This is a fallback for drivers that don't natively support XDP. The kernel runs the XDP program after* an sk_buff has already been created, but before the packet is passed up to the main network stack.

    The veth driver, which is almost universally used for connecting pods to the node's network namespace, does not support native XDP. Therefore, when you attach an XDP program to a pod's veth interface, you are running in XDP_MODE_SKB. This significantly erodes the performance advantage of XDP over TC, as the costly sk_buff allocation has already happened. The primary benefit that remains is bypassing the rest of the TC layer and IP stack.

    Pros & Cons of XDP

    * Pros:

    * Unmatched Performance (in Native Mode): The fastest possible packet processing for simple actions like DROP, PASS, or TX (transmit).

    * DDoS Mitigation: Ideal for dropping large volumes of unwanted traffic at the earliest point, protecting the host kernel from being overwhelmed.

    * Cons:

    * Limited Context: The xdp_md struct provides minimal information, making complex, stateful decisions difficult or impossible.

    * No sk_buff: Cannot easily interact with sockets or higher-level kernel network functions.

    * Driver Dependency: XDP_MODE_NATIVE requires explicit driver support. For Kubernetes pods, you are almost always in the less performant XDP_MODE_SKB.


    Production Scenario Analysis & Benchmarking

    Let's analyze two realistic scenarios to solidify the decision framework.

    Scenario 1: High-Throughput L4 Firewall

    * Problem: You have a cluster running high-frequency trading or data streaming applications. East-west traffic is extremely high, and policies are simple L3/L4 rules (e.g., Pod A can talk to Pod B on port 3000). Every microsecond of latency and every CPU cycle counts.

    * Analysis: Even in XDP_MODE_SKB on a veth pair, XDP will likely outperform TC. It executes earlier and bypasses more of the kernel stack. While the sk_buff is already allocated, we save the cost of traversing the qdisc and upper IP stack layers for packets that will be dropped.

    * Benchmark Strategy:

    1. Create a network namespace with a veth pair (veth-a and veth-b).

    2. In one terminal, run iperf3 -s -p 8080 in the namespace.

    3. In another, attach either the TC or XDP BPF program to veth-a, configured to allow port 8080.

    4. Run iperf3 -c -p 8080 -t 60 -O 5 to generate traffic.

    5. While the test runs, use mpstat -P ALL 1 or perf to measure the CPU usage on the cores handling the network softirqs.

    6. The implementation with lower CPU usage for the same iperf3 throughput is the more efficient one.

    Expected Outcome: You would expect the XDP implementation to consume slightly fewer CPU cycles per packet processed compared to the TC implementation, even in generic mode. The difference might be small (e.g., 5-15%) but can be significant at massive scale.

    Scenario 2: Zero-Trust Security with L7 Policy

    * Problem: You need to enforce that Pod A (a Prometheus scraper) can only issue a GET /metrics request to Pod B (an application), but all other HTTP methods (POST, PUT, etc.) are forbidden.

    * Analysis: XDP is completely unsuitable for this task. It operates at L2/L3/L4 and has no straightforward way to parse HTTP requests, which requires reassembling TCP streams. TC is the only viable eBPF option here.

    * Advanced Implementation Pattern: A pure BPF solution for this is complex. The production-grade pattern is a hybrid approach:

    1. A TC eBPF program is attached to the pod's veth ingress/egress.

    2. The program performs initial L4 filtering. For connections on ports that require L7 inspection (e.g., port 80 on Pod B), the BPF program uses a helper like bpf_msg_redirect_hash to redirect the traffic's socket to a userspace proxy (like Envoy) via a BPF_MAP_TYPE_SOCKMAP.

    3. The Envoy proxy, running on the node, performs the full L7 policy enforcement and then forwards the allowed traffic to its final destination.

    This pattern, used by Cilium, combines the L4 performance of eBPF with the rich L7 capabilities of mature userspace proxies, providing the best of both worlds.


    Final Verdict: A Framework for Decision

    The choice between TC and XDP for Kubernetes network policy is not about which is universally 'better', but which is right for the specific job and operational context.

    Feature / ConsiderationTraffic Control (TC)Express Data Path (XDP)Winner For K8s Pods
    Performance (L3/L4 Drop)Excellent (far better than iptables)Exceptional (lowest overhead)XDP (Slightly), even in generic mode
    Context AvailabilityFull sk_buffMinimal xdp_mdTC for any stateful or complex logic
    L7 Policy SupportYes (via helpers and userspace proxy integration)NoTC
    Packet ModificationHighly FlexibleLimited (header rewrites are possible but complex)TC
    Implementation SimplicityMore complex (requires qdisc setup)Simpler (direct interface attachment)XDP
    K8s veth ModeN/AXDP_MODE_SKB (Generic)Acknowledges XDP's primary advantage (native mode) is lost

    Decision Framework:

  • Is your policy purely L3/L4 allow/drop?
  • * Yes: And you are operating at a scale where every CPU cycle for packet processing is critical? Consider XDP. The marginal performance gain over TC might be worthwhile.

    * No: Proceed to question 2.

  • Do you need to inspect packet payloads, maintain connection state beyond simple tuples, or integrate with userspace proxies for L7 policy?
  • * Yes: TC is your only choice. Its access to the sk_buff and its position in the stack are non-negotiable requirements for these tasks.

    For the vast majority of general-purpose Kubernetes NetworkPolicy implementations, the TC-based approach offers the best balance of performance, flexibility, and future-proofing. The performance is already orders of magnitude better than iptables, and it doesn't close the door to implementing more advanced, L7-aware features in the future. XDP shines in more specialized, high-volume filtering roles, such as node-level DDoS protection on the physical NIC (where native mode is available) or for extremely latency-sensitive workloads with simple policies.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles