eBPF for K8s: TC Hook vs. XDP for Network Policy Enforcement
The Core Dilemma: Kernel Hooking for Pod-to-Pod Policy
As organizations scale their Kubernetes clusters, the performance limitations of iptables-based network policy enforcement become a significant bottleneck. The kernel's connection tracking system (conntrack) and the linear nature of iptables chains introduce latency and CPU overhead that simply don't scale to thousands of pods per node. This has driven the adoption of eBPF, which allows for programmable, high-performance packet processing directly within the kernel.
Mature CNI projects like Cilium and Calico have demonstrated the power of eBPF, but for engineers building custom networking solutions or needing to understand the underlying mechanics, a critical design decision emerges: where in the kernel's networking stack should we attach our eBPF programs? The two primary candidates for implementing pod-to-pod NetworkPolicy are the Traffic Control (TC) subsystem and the Express Data Path (XDP).
This article assumes you are familiar with Kubernetes CNI, NetworkPolicy resources, and the basics of eBPF. We will not cover introductory concepts. Instead, we will dissect the specific architectural and performance trade-offs between a TC-based and an XDP-based approach for enforcing L3/L4 network policies between pods connected via a veth pair.
Our analysis will focus on:
sk_buff vs. xdp_md) is available to the BPF program.Let's visualize the packet flow to frame the discussion:
          User Space (e.g., Pod Application)
                   ^
                   | Socket Layer
+------------------|-------------------------------------------------+
|                  v                                                 |
|  Kernel Space    TCP/IP Stack (L4/L3 Processing, conntrack, etc)   |
|                  ^
|                  | 
|  +---------------|----------------+  <-- [TC Hook Point: ingress/egress]
|  |               v                |      (Operates on sk_buff)
|  |         Queue Discipline       |
|  |               |                |
|  +---------------|----------------+
|                  v 
|  +---------------|----------------+  <-- [XDP Hook Point]
|  |               v                |      (Operates on xdp_md)
|  |         Network Driver         |
|  +---------------|----------------+
|                  |
+------------------|-------------------------------------------------+
                   v
             Physical/Virtual NIC (e.g., veth, eth0)XDP hooks run at the earliest possible point, directly in the driver, before the kernel even allocates a sk_buff (socket buffer). TC hooks run much later, after the packet has been encapsulated in an sk_buff and passed through parts of the IP stack. This fundamental difference is the source of all subsequent trade-offs.
Deep Dive: The Traffic Control (TC) Hook Approach
The TC subsystem has long been the standard Linux framework for traffic shaping and queueing. The cls_bpf classifier allows us to attach an eBPF program to a qdisc (queueing discipline), which can be attached to the ingress or egress of a network interface. For Kubernetes, this means attaching to each pod's veth interface.
Implementation Pattern
The core of the TC approach is an eBPF program that receives the full struct __sk_buff * as its context. This gives it access to nearly all metadata associated with the packet, making it incredibly flexible.
1. The eBPF C Program (tc_policy.c)
This program defines a simple L4 policy: it allows or denies traffic based on a destination port stored in a BPF map. This map will be populated by our userspace controller.
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
// BPF map to store allowed destination ports.
// Key: destination port (u16)
// Value: 1 if allowed, 0 if not (we only need to store allowed ports)
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, __u16);
    __type(value, __u8);
} policy_map SEC(".maps");
// BPF map for observability: count dropped packets.
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, __u32);
    __type(value, __u64);
} drop_count_map SEC(".maps");
SEC("classifier")
int enforce_policy(struct __sk_buff *skb) {
    void *data_end = (void *)(long)skb->data_end;
    void *data = (void *)(long)skb->data;
    // L2 Header
    struct ethhdr *eth = data;
    if ((void*)eth + sizeof(*eth) > data_end) {
        return TC_ACT_OK; // Not a valid ethernet frame
    }
    // We only care about IPv4 for this example
    if (eth->h_proto != bpf_htons(ETH_P_IP)) {
        return TC_ACT_OK;
    }
    // L3 Header
    struct iphdr *ip = data + sizeof(*eth);
    if ((void*)ip + sizeof(*ip) > data_end) {
        return TC_ACT_OK;
    }
    // We only care about TCP
    if (ip->protocol != IPPROTO_TCP) {
        return TC_ACT_OK;
    }
    // L4 Header
    struct tcphdr *tcp = (void*)ip + sizeof(*ip);
    if ((void*)tcp + sizeof(*tcp) > data_end) {
        return TC_ACT_OK;
    }
    __u16 dport = bpf_ntohs(tcp->dest);
    // Check policy map
    __u8 *allowed = bpf_map_lookup_elem(&policy_map, &dport);
    if (allowed) {
        // Port is in the allowed list
        bpf_printk("TC: Allowing packet to dport %d", dport);
        return TC_ACT_OK; // TC_ACT_OK means pass the packet
    }
    // Default drop if not explicitly allowed
    bpf_printk("TC: Dropping packet to dport %d", dport);
    // Increment drop counter for observability
    __u32 key = 0;
    __u64 *count = bpf_map_lookup_elem(&drop_count_map, &key);
    if (count) {
        __sync_fetch_and_add(count, 1);
    }
    return TC_ACT_SHOT; // TC_ACT_SHOT means drop the packet
}
char __license[] SEC("license") = "GPL";2. The Go Userspace Controller (main_tc.go)
This Go application is responsible for compiling the C code (using bpf2go), loading the resulting eBPF object into the kernel, creating the necessary qdisc (clsact), and attaching the program to a specific network interface's ingress hook.
package main
import (
	"fmt"
	"log"
	"net"
	"os"
	"os/signal"
	"syscall"
	"github.com/cilium/ebpf"
	"github.com/cilium/ebpf/link"
	"github.com/vishvananda/netlink"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf tc_policy.c -- -I./headers
func main() {
	if len(os.Args) < 2 {
		log.Fatalf("Usage: %s <interface-name>", os.Args[0])
	}
	ifaceName := os.Args[1]
	// Look up the network interface by name.
	iface, err := net.InterfaceByName(ifaceName)
	if err != nil {
		log.Fatalf("lookup network iface %q: %s", ifaceName, err)
	}
	// Load pre-compiled programs and maps into the kernel.
	objs := bpfObjects{}
	if err := loadBpfObjects(&objs, nil); err != nil {
		log.Fatalf("loading objects: %s", err)
	}
	defer objs.Close()
	// Add a clsact qdisc to the interface. This is a special qdisc that
	// acts as a container for BPF classifiers on ingress and egress.
	qdisc := &netlink.Clsact{QdiscAttrs: netlink.QdiscAttrs{
		LinkIndex: iface.Index,
		Handle:    netlink.MakeHandle(0xffff, 0),
		Parent:    netlink.HANDLE_CLSACT,
	}}
	if err := netlink.QdiscAdd(qdisc); err != nil {
		// It's fine if it already exists.
		if !os.IsExist(err) {
			log.Fatalf("cannot add clsact qdisc: %v", err)
		}
	}
	// Attach the BPF program to the ingress hook of the TC qdisc.
	// We are filtering incoming traffic to the pod's veth.
	l, err := link.AttachTCX(
		link.TCXOptions{
			Program:   objs.EnforcePolicy,
			Interface: iface.Index,
			Attach:    ebpf.AttachTCXIngress,
		},
	)
	if err != nil {
		log.Fatalf("could not attach TC program: %s", err)
	}
	defer l.Close()
	log.Printf("Attached TC program to iface %q (index %d)", iface.Name, iface.Index)
	// *** Production Pattern: Atomic Policy Updates ***
	// We will allow traffic to port 8080 and 9090.
	allowedPorts := []uint16{8080, 9090}
	var one uint8 = 1
	for _, port := range allowedPorts {
		if err := objs.PolicyMap.Put(port, one); err != nil {
			log.Fatalf("failed to update policy map: %v", err)
		}
	}
	log.Printf("Policy updated: allowing traffic to ports %v", allowedPorts)
	// Wait for a signal to exit.
	stop := make(chan os.Signal, 1)
	signal.Notify(stop, os.Interrupt, syscall.SIGTERM)
	<-stop
	log.Println("Received signal, exiting...")
}State Management & Atomic Updates
In a real Kubernetes CNI, NetworkPolicy objects are created, updated, and deleted continuously. The userspace controller (agent) must translate these changes into BPF map entries without disrupting existing connections or introducing race conditions.
The pattern shown above—simply calling Put on the map—is sufficient for this simple hash map. For more complex policies involving thousands of rules, a common production pattern is to use double-buffering. Two identical maps are created. The BPF program reads from the 'active' map. When a policy update occurs, the controller populates the 'inactive' map with the complete new ruleset. Once ready, it atomically swaps which map the BPF program points to using a BPF_MAP_TYPE_PROG_ARRAY or by updating a global variable. This ensures there's never a state where a partial policy is being enforced.
Pros & Cons of TC
* Pros:
    *   Full Context: Access to the sk_buff provides rich metadata, including socket information, L4 headers, and even packet payload (with care), making it suitable for stateful firewalls and L7-aware policy stubs.
* Flexibility: Can modify packets in complex ways, redirect them to other interfaces, or even clone them for monitoring.
* Maturity: The TC subsystem is a well-established part of the kernel.
* Cons:
    *   Performance Overhead: It executes later in the stack. The kernel has already spent CPU cycles allocating the sk_buff and performing initial IP-level processing before our BPF program even runs. For simple drop/allow decisions, this is wasted work.
Deep Dive: The Express Data Path (XDP) Hook Approach
XDP is designed for the highest possible packet processing performance. By attaching to the network driver, it can make decisions before the packet even hits the main kernel networking stack.
Implementation Pattern
The XDP program operates on a struct xdp_md *, a much lighter-weight context than the sk_buff. It contains pointers to the raw packet data and not much else.
1. The eBPF C Program (xdp_policy.c)
The structure is similar to the TC program, but the context and return values are different.
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
// Maps are identical to the TC example
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, __u16);
    __type(value, __u8);
} policy_map SEC(".maps");
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, __u32);
    __type(value, __u64);
} drop_count_map SEC(".maps");
SEC("xdp")
int enforce_policy(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    // Same parsing logic as before
    struct ethhdr *eth = data;
    if ((void*)eth + sizeof(*eth) > data_end) {
        return XDP_PASS;
    }
    if (eth->h_proto != bpf_htons(ETH_P_IP)) {
        return XDP_PASS;
    }
    struct iphdr *ip = data + sizeof(*eth);
    if ((void*)ip + sizeof(*ip) > data_end) {
        return XDP_PASS;
    }
    if (ip->protocol != IPPROTO_TCP) {
        return XDP_PASS;
    }
    struct tcphdr *tcp = (void*)ip + sizeof(*ip);
    if ((void*)tcp + sizeof(*tcp) > data_end) {
        return XDP_PASS;
    }
    __u16 dport = bpf_ntohs(tcp->dest);
    // Check policy map
    __u8 *allowed = bpf_map_lookup_elem(&policy_map, &dport);
    if (allowed) {
        bpf_printk("XDP: Allowing packet to dport %d", dport);
        return XDP_PASS;
    }
    bpf_printk("XDP: Dropping packet to dport %d", dport);
    // Increment drop counter
    __u32 key = 0;
    __u64 *count = bpf_map_lookup_elem(&drop_count_map, &key);
    if (count) {
        __sync_fetch_and_add(count, 1);
    }
    return XDP_DROP;
}
char __license[] SEC("license") = "GPL";2. The Go Userspace Controller (main_xdp.go)
The controller is simpler. We don't need to manage qdiscs; we just attach the program directly to the interface.
package main
import (
	"log"
	"net"
	"os"
	"os/signal"
	"syscall"
	"github.com/cilium/ebpf/link"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf xdp_policy.c -- -I./headers
func main() {
	if len(os.Args) < 2 {
		log.Fatalf("Usage: %s <interface-name>", os.Args[0])
	}
	ifaceName := os.Args[1]
	iface, err := net.InterfaceByName(ifaceName)
	if err != nil {
		log.Fatalf("lookup network iface %q: %s", ifaceName, err)
	}
	objs := bpfObjects{}
	if err := loadBpfObjects(&objs, nil); err != nil {
		log.Fatalf("loading objects: %s", err)
	}
	defer objs.Close()
	// Attach the XDP program.
	// The flag 0 means to use the default mode, which is typically
	// XDP_MODE_NATIVE if the driver supports it, or XDP_MODE_SKB as a fallback.
	l, err := link.AttachXDP(link.XDPAttachOptions{
		Program:   objs.EnforcePolicy,
		Interface: iface.Index,
	})
	if err != nil {
		log.Fatalf("could not attach XDP program: %s", err)
	}
	defer l.Close()
	log.Printf("Attached XDP program to iface %q (index %d)", iface.Name, iface.Index)
	// Policy update logic is identical to the TC example
	allowedPorts := []uint16{8080, 9090}
	var one uint8 = 1
	for _, port := range allowedPorts {
		if err := objs.PolicyMap.Put(port, one); err != nil {
			log.Fatalf("failed to update policy map: %v", err)
		}
	}
	log.Printf("Policy updated: allowing traffic to ports %v", allowedPorts)
	stop := make(chan os.Signal, 1)
	signal.Notify(stop, os.Interrupt, syscall.SIGTERM)
	<-stop
	log.Println("Received signal, exiting...")
}Edge Case: `XDP_MODE_NATIVE` vs. `XDP_MODE_SKB` (Generic XDP)
This is the most critical detail for XDP performance in a Kubernetes context.
*   XDP_MODE_NATIVE: The BPF program runs directly in the network driver's receive path. This is the fastest mode, offering the lowest latency.
   XDP_MODE_SKB (or Generic XDP): This is a fallback for drivers that don't natively support XDP. The kernel runs the XDP program after* an sk_buff has already been created, but before the packet is passed up to the main network stack. 
The veth driver, which is almost universally used for connecting pods to the node's network namespace, does not support native XDP. Therefore, when you attach an XDP program to a pod's veth interface, you are running in XDP_MODE_SKB. This significantly erodes the performance advantage of XDP over TC, as the costly sk_buff allocation has already happened. The primary benefit that remains is bypassing the rest of the TC layer and IP stack.
Pros & Cons of XDP
* Pros:
    *   Unmatched Performance (in Native Mode): The fastest possible packet processing for simple actions like DROP, PASS, or TX (transmit).
* DDoS Mitigation: Ideal for dropping large volumes of unwanted traffic at the earliest point, protecting the host kernel from being overwhelmed.
* Cons:
    *   Limited Context: The xdp_md struct provides minimal information, making complex, stateful decisions difficult or impossible.
    *   No sk_buff: Cannot easily interact with sockets or higher-level kernel network functions.
    *   Driver Dependency: XDP_MODE_NATIVE requires explicit driver support. For Kubernetes pods, you are almost always in the less performant XDP_MODE_SKB.
Production Scenario Analysis & Benchmarking
Let's analyze two realistic scenarios to solidify the decision framework.
Scenario 1: High-Throughput L4 Firewall
* Problem: You have a cluster running high-frequency trading or data streaming applications. East-west traffic is extremely high, and policies are simple L3/L4 rules (e.g., Pod A can talk to Pod B on port 3000). Every microsecond of latency and every CPU cycle counts.
*   Analysis: Even in XDP_MODE_SKB on a veth pair, XDP will likely outperform TC. It executes earlier and bypasses more of the kernel stack. While the sk_buff is already allocated, we save the cost of traversing the qdisc and upper IP stack layers for packets that will be dropped.
* Benchmark Strategy:
    1.  Create a network namespace with a veth pair (veth-a and veth-b).
    2.  In one terminal, run iperf3 -s -p 8080 in the namespace.
    3.  In another, attach either the TC or XDP BPF program to veth-a, configured to allow port 8080.
    4.  Run iperf3 -c  to generate traffic.
    5.  While the test runs, use mpstat -P ALL 1 or perf to measure the CPU usage on the cores handling the network softirqs.
    6.  The implementation with lower CPU usage for the same iperf3 throughput is the more efficient one.
Expected Outcome: You would expect the XDP implementation to consume slightly fewer CPU cycles per packet processed compared to the TC implementation, even in generic mode. The difference might be small (e.g., 5-15%) but can be significant at massive scale.
Scenario 2: Zero-Trust Security with L7 Policy
*   Problem: You need to enforce that Pod A (a Prometheus scraper) can only issue a GET /metrics request to Pod B (an application), but all other HTTP methods (POST, PUT, etc.) are forbidden.
* Analysis: XDP is completely unsuitable for this task. It operates at L2/L3/L4 and has no straightforward way to parse HTTP requests, which requires reassembling TCP streams. TC is the only viable eBPF option here.
* Advanced Implementation Pattern: A pure BPF solution for this is complex. The production-grade pattern is a hybrid approach:
    1.  A TC eBPF program is attached to the pod's veth ingress/egress.
    2.  The program performs initial L4 filtering. For connections on ports that require L7 inspection (e.g., port 80 on Pod B), the BPF program uses a helper like bpf_msg_redirect_hash to redirect the traffic's socket to a userspace proxy (like Envoy) via a BPF_MAP_TYPE_SOCKMAP.
3. The Envoy proxy, running on the node, performs the full L7 policy enforcement and then forwards the allowed traffic to its final destination.
This pattern, used by Cilium, combines the L4 performance of eBPF with the rich L7 capabilities of mature userspace proxies, providing the best of both worlds.
Final Verdict: A Framework for Decision
The choice between TC and XDP for Kubernetes network policy is not about which is universally 'better', but which is right for the specific job and operational context.
| Feature / Consideration | Traffic Control (TC) | Express Data Path (XDP) | Winner For K8s Pods | 
|---|---|---|---|
| Performance (L3/L4 Drop) | Excellent (far better than iptables) | Exceptional (lowest overhead) | XDP (Slightly), even in generic mode | 
| Context Availability | Full sk_buff | Minimal xdp_md | TC for any stateful or complex logic | 
| L7 Policy Support | Yes (via helpers and userspace proxy integration) | No | TC | 
| Packet Modification | Highly Flexible | Limited (header rewrites are possible but complex) | TC | 
| Implementation Simplicity | More complex (requires qdisc setup) | Simpler (direct interface attachment) | XDP | 
| K8s vethMode | N/A | XDP_MODE_SKB(Generic) | Acknowledges XDP's primary advantage (native mode) is lost | 
Decision Framework:
* Yes: And you are operating at a scale where every CPU cycle for packet processing is critical? Consider XDP. The marginal performance gain over TC might be worthwhile.
* No: Proceed to question 2.
    *   Yes: TC is your only choice. Its access to the sk_buff and its position in the stack are non-negotiable requirements for these tasks.
For the vast majority of general-purpose Kubernetes NetworkPolicy implementations, the TC-based approach offers the best balance of performance, flexibility, and future-proofing. The performance is already orders of magnitude better than iptables, and it doesn't close the door to implementing more advanced, L7-aware features in the future. XDP shines in more specialized, high-volume filtering roles, such as node-level DDoS protection on the physical NIC (where native mode is available) or for extremely latency-sensitive workloads with simple policies.