Leveraging eBPF for Kernel-Level Runtime Security in Kubernetes

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Observability Gap in Ephemeral Infrastructure

In modern Kubernetes environments, traditional security paradigms fall short. Static container scanning, network policies, and admission controllers are necessary but insufficient layers of defense. They operate before runtime and lack visibility into the actual behavior of a process once it's running. What happens when a zero-day vulnerability in a web server allows an attacker to spawn a shell? Your static analysis is useless, and your network policies might not detect the command-and-control traffic if it tunnels over HTTP/S. This is the runtime security problem, and its solution lies directly within the Linux kernel.

Enter eBPF (extended Berkeley Packet Filter). eBPF is a revolutionary technology that allows us to run sandboxed programs within the kernel without changing kernel source code or loading kernel modules. For security engineers, this is the holy grail: the ability to observe and even control system behavior at the most fundamental level, with minimal performance overhead.

This article is not an introduction to eBPF. It assumes you understand what eBPF is and why it's powerful. Instead, we will focus on a critical, production-level implementation pattern: building a targeted, K8s-aware syscall monitor that uses efficient kernel-side filtering to minimize data overhead. We will build a tool that can detect when a process inside a specific, labeled pod executes a new program (execve syscall), and we'll do it in a way that is scalable and performant enough for a production cluster.


Section 1: Architecture and Toolchain for Production eBPF

Before writing code, we must architect our solution and choose the right tools. A naive eBPF tool might attach to a syscall and dump every event to user space for filtering. This is untenable in production; a single execve tracepoint on a busy node can generate thousands of events per second from legitimate system activity, creating massive CPU and I/O load.

Our architecture will be smarter:

  • User-space Controller (Go): This component will watch the Kubernetes API for pods with a specific label (e.g., security.monitor=true).
  • Kernel-side Filter (eBPF Map): When our controller identifies a target pod, it will determine the pod's cgroup ID and write it into a BPF_MAP_TYPE_HASH.
  • eBPF Program (C): An eBPF program attached to the execve syscall tracepoint will fire on every execution. Crucially, its first action will be to check if the process's cgroup ID exists in our hash map. If not, it exits immediately. This is our kernel-side filter.
  • Data Pipeline (eBPF Ring Buffer): If the cgroup ID matches, the eBPF program gathers process and syscall metadata and pushes it to a BPF_MAP_TYPE_RINGBUF, a modern, high-performance, and lossless mechanism for kernel-to-user-space communication.
  • User-space Agent (Go): The same Go application will read from the ring buffer, receive the filtered events, enrich them with Kubernetes metadata (pod name, namespace), and log them as security alerts.
  • Toolchain: `libbpf-bootstrap` over BCC

    For production eBPF, the choice between the BPF Compiler Collection (BCC) and a libbpf-based approach is critical. BCC is excellent for ad-hoc debugging and scripting, as it embeds a C-to-BPF compiler. However, this means you must ship the entire LLVM/Clang toolchain to your production nodes, a significant dependency and potential attack surface.

    We will use the libbpf-bootstrap model. This approach leverages CO-RE (Compile Once – Run Everywhere). We compile our eBPF C code into a compact object file on a build machine. This object file contains BPF bytecode and relocation information derived from BTF (BPF Type Format), which describes kernel types. The libbpf library on the target node uses this information to adapt the program to the running kernel's specific memory layouts and offsets. This results in a small, self-contained binary with no runtime compilation dependencies.

    Our project structure will look like this:

    text
    /ebpf-security-monitor
    |-- go.mod
    |-- main.go              # User-space Go controller/agent
    |-- bpf/
    |   |-- monitor.bpf.c    # The eBPF C code
    |   |-- monitor.bpf.h    # Shared structs between C and Go
    |   |-- vmlinux.h        # Kernel type definitions for CO-RE
    |-- Makefile             # To compile the eBPF C code

    Section 2: Implementing the eBPF Kernel Program

    Let's write the heart of our monitor: the eBPF C code. This code will be compiled into an ELF object file and loaded by our Go application.

    First, we need a header file (bpf/monitor.bpf.h) to define the data structure for events we send to user space. Sharing this definition ensures type safety between the kernel and user space.

    bpf/monitor.bpf.h

    c
    #ifndef __MONITOR_BPF_H
    #define __MONITOR_BPF_H
    
    #define TASK_COMM_LEN 16
    #define FILENAME_LEN 256
    
    struct event {
        __u32 pid;
        __u64 cgroup_id;
        char comm[TASK_COMM_LEN];
        char filename[FILENAME_LEN];
        int retval;
    };
    
    #endif /* __MONITOR_BPF_H */

    Now for the main eBPF program. This file defines our maps and the tracepoint logic.

    bpf/monitor.bpf.c

    c
    #include "vmlinux.h"
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_tracing.h>
    #include "monitor.bpf.h"
    
    // Ring buffer for sending events to user space
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024); // 256 KB
    } rb SEC(".maps");
    
    // Hash map for filtering by cgroup ID. User-space populates this.
    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 10240);
        __type(key, __u64);
        __type(value, __u8);
    } monitored_cgroups SEC(".maps");
    
    // This tracepoint is more reliable than a kprobe on do_execve
    // as its API is stable across kernel versions.
    SEC("tracepoint/syscalls/sys_enter_execve")
    int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
        // 1. Get cgroup ID for the current process
        __u64 cgroup_id = bpf_get_current_cgroup_id();
    
        // 2. Perform the kernel-side filter check
        // bpf_map_lookup_elem is a highly efficient hash table lookup.
        void *is_monitored = bpf_map_lookup_elem(&monitored_cgroups, &cgroup_id);
        if (!is_monitored) {
            return 0; // Not a monitored cgroup, exit immediately.
        }
    
        // 3. Reserve space on the ring buffer for our event
        struct event *e;
        e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
        if (!e) {
            return 0; // Failed to reserve space, maybe buffer is full.
        }
    
        // 4. Populate the event with data
        __u64 id = bpf_get_current_pid_tgid();
        e->pid = (__u32)id;
        e->cgroup_id = cgroup_id;
        bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
        // 5. Read the filename argument from user-space memory
        // This is a tricky and potentially unsafe operation if not done carefully.
        const char __user *filename_ptr = (const char __user *)ctx->args[0];
        long ret = bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
        if (ret < 0) {
            // Error reading string, maybe invalid pointer. We still send the event
            // but with a truncated or empty filename to signal the issue.
            bpf_ringbuf_discard(e, 0);
            return 1;
        }
    
        // 6. Submit the event to user space
        bpf_ringbuf_submit(e, 0);
    
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Decisions in the eBPF Code:

    * tracepoint/syscalls/sys_enter_execve: We use a tracepoint instead of a kprobe. Tracepoints are stable API points in the kernel, making our program more robust across kernel updates. Kprobes attach to function names which can change.

    * bpf_map_lookup_elem: This is the core of our filtering logic. The check happens entirely in the kernel. If the cgroup ID of the process executing execve is not in our monitored_cgroups map, the program exits, having consumed negligible CPU cycles.

    * bpf_ringbuf_reserve/submit: We use a ring buffer. Unlike the older perf buffer, it guarantees event order, prevents data overwriting (events are dropped if the buffer is full, but existing ones are safe), and is generally more efficient for high-throughput event streaming.

    bpf_probe_read_user_str Error Handling: Reading from user-space memory is a privileged operation that can fail (e.g., page fault). We must* check the return value. Here, we discard the event if the read fails, preventing corrupted data from reaching our agent. A more advanced implementation might send an error event.

    To compile this, we need a Makefile and to generate vmlinux.h.

    Makefile

    makefile
    .PHONY: all clean
    
    # BPF object file
    BPF_OBJ = bpf/monitor.bpf.o
    
    # User-space binary
    GO_BIN = ebpf-monitor
    
    # Tools
    CLANG = clang
    GO = go
    BPFTOOL = bpftool
    
    all: $(GO_BIN)
    
    # Compile Go binary
    $(GO_BIN): main.go $(BPF_OBJ)
    	$(GO) build -o $(GO_BIN) main.go
    
    # Generate skeletons and compile BPF C code
    $(BPF_OBJ): bpf/monitor.bpf.c bpf/vmlinux.h
    	$(CLANG) -g -O2 -target bpf -c bpf/monitor.bpf.c -o $(BPF_OBJ)
    
    # Generate vmlinux.h for CO-RE
    # This needs to be run once on a machine with kernel debug symbols
    # (e.g., via `bpftool btf dump file /sys/kernel/btf/vmlinux format c > bpf/vmlinux.h`)
    bpf/vmlinux.h:
    	@echo "Downloading vmlinux.h from btfhub..."
    	@curl -L https://github.com/aquasecurity/btfhub-archive/raw/main/$(shell uname -m)/$(shell uname -r | sed 's/[^0-9a-zA-Z.-]/_/g').btf.tar.xz | tar -xJO > bpf/vmlinux.h
    
    clean:
    	rm -f $(GO_BIN) $(BPF_OBJ)

    Note: The Makefile includes a command to download a pre-generated vmlinux.h for your kernel version from the btfhub-archive, simplifying the setup.


    Section 3: The User-space Go Controller and Agent

    Now we'll build the Go application that orchestrates the entire process. It will use the cilium/ebpf library for interacting with the eBPF subsystem and the official Kubernetes Go client.

    main.go

    go
    package main
    
    import (
    	"bytes"
    	"context"
    	"encoding/binary"
    	"errors"
    	"fmt"
    	"log"
    	"os"
    	"os/signal"
    	"regexp"
    	"strconv"
    	"strings"
    	"syscall"
    
    	"github.com/cilium/ebpf/link"
    	"github.com/cilium/ebpf/ringbuf"
    	"k8s.io/api/core/v1"
    	"k8s.io/client-go/kubernetes"
    	"k8s.io/client-go/rest"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall" bpf ./bpf/monitor.bpf.c -- -I./bpf
    
    const (
    	MONITOR_LABEL = "security.monitor"
    )
    
    // cgroupV2IDRegex is used to extract the cgroup ID from /proc/[pid]/cgroup on cgroup v2 systems.
    var cgroupV2IDRegex = regexp.MustCompile(`^0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod[a-f0-9_]+\.slice/cri-containerd-([a-f0-9]+)\.scope$`)
    
    func main() {
    	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
    	defer stop()
    
    	// Load the compiled eBPF objects
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading bpf objects: %v", err)
    	}
    	defer objs.Close()
    
    	// Attach the tracepoint
    	tp, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.TracepointSyscallsSysEnterExecve, nil)
    	if err != nil {
    		log.Fatalf("attaching tracepoint: %v", err)
    	}
    	defer tp.Close()
    
    	log.Println("eBPF programs loaded and attached.")
    
    	// Set up Kubernetes client
    	k8sClient, err := newKubernetesClient()
    	if err != nil {
    		log.Fatalf("creating k8s client: %v", err)
    	}
    
    	// Start the pod monitor goroutine
    	go monitorPods(ctx, k8sClient, objs.MonitoredCgroups)
    
    	// Set up the ring buffer reader
    	rd, err := ringbuf.NewReader(objs.Rb)
    	if err != nil {
    		log.Fatalf("creating ringbuf reader: %v", err)
    	}
    	defer rd.Close()
    
    	// Start the event reading loop
    	log.Println("Waiting for events...")
    
    	var event bpfEvent
    	for {
    		if ctx.Err() != nil {
    			return
    		}
    
    		record, err := rd.Read()
    		if err != nil {
    			if errors.Is(err, ringbuf.ErrClosed) {
    				log.Println("Received signal, exiting...")
    				return
    			}
    			log.Printf("reading from reader: %s", err)
    			continue
    		}
    
    		if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
    			log.Printf("parsing ringbuf event: %s", err)
    			continue
    		}
    
    		comm := string(event.Comm[:bytes.IndexByte(event.Comm[:], 0)])
    		filename := string(event.Filename[:bytes.IndexByte(event.Filename[:], 0)])
    
    		log.Printf("ALERT: Process '%s' (PID: %d) executed '%s' in a monitored pod (cgroup: %d)", comm, event.Pid, filename, event.CgroupId)
    	}
    }
    
    func newKubernetesClient() (*kubernetes.Clientset, error) {
    	config, err := rest.InClusterConfig()
    	if err != nil {
    		return nil, fmt.Errorf("getting in-cluster config: %w", err)
    	}
    	clientset, err := kubernetes.NewForConfig(config)
    	if err != nil {
    		return nil, fmt.Errorf("creating clientset: %w", err)
    	}
    	return clientset, nil
    }
    
    // monitorPods watches for pods with the security.monitor label and updates the eBPF map.
    func monitorPods(ctx context.Context, clientset *kubernetes.Clientset, cgroupsMap *ebpf.Map) {
    	watcher, err := clientset.CoreV1().Pods("").Watch(ctx, metav1.ListOptions{LabelSelector: MONITOR_LABEL + "=true"})
    	if err != nil {
    		log.Fatalf("failed to watch pods: %v", err)
    	}
    
    	log.Println("Watching for pods with label 'security.monitor=true'")
    
    	for event := range watcher.ResultChan() {
    		pod, ok := event.Object.(*v1.Pod)
    		if !ok {
    			continue
    		}
    
    		cgroupID, err := getCgroupIDForPod(pod)
    		if err != nil {
    			log.Printf("could not get cgroup id for pod %s/%s: %v", pod.Namespace, pod.Name, err)
    			continue
    		}
    
    		key := cgroupID
    		value := uint8(1)
    
    		switch event.Type {
    		case "ADDED", "MODIFIED":
    			if err := cgroupsMap.Put(&key, &value); err != nil {
    				log.Printf("failed to update cgroups map for pod %s: %v", pod.Name, err)
    			}
    			log.Printf("Started monitoring pod: %s/%s (cgroup ID: %d)", pod.Namespace, pod.Name, cgroupID)
    		case "DELETED":
    			if err := cgroupsMap.Delete(&key); err != nil {
    				log.Printf("failed to delete from cgroups map for pod %s: %v", pod.Name, err)
    			}
    			log.Printf("Stopped monitoring pod: %s/%s (cgroup ID: %d)", pod.Namespace, pod.Name, cgroupID)
    		}
    	}
    }
    
    // getCgroupIDForPod is a complex and fragile part of the system.
    // This is a simplified example for cgroup v2 and a specific CRI layout.
    // Production systems need to handle cgroup v1 and different CRI runtimes.
    func getCgroupIDForPod(pod *v1.Pod) (uint64, error) {
    	if len(pod.Status.ContainerStatuses) == 0 {
    		return 0, fmt.Errorf("pod has no container statuses")
    	}
    	containerID := pod.Status.ContainerStatuses[0].ContainerID
    	if containerID == "" {
    		return 0, fmt.Errorf("container ID is empty")
    	}
    
    	// Example format: containerd://<id>
    	parts := strings.Split(containerID, "//")
    	if len(parts) != 2 {
    		return 0, fmt.Errorf("malformed containerID: %s", containerID)
    	}
    	shortID := parts[1][:12] // Use a shortened ID for path matching
    
    	// This is a massive simplification. In reality, you'd walk the cgroupfs.
    	// Here we just construct an expected path and get its handle.
    	// This path is highly specific to this environment setup.
    	podUID := strings.ReplaceAll(string(pod.UID), "-", "_")
    	path := fmt.Sprintf("/sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod%s.slice/cri-containerd-%s.scope", podUID, shortID)
    
        // Use open and stat to get the handle.
        f, err := os.Open(path)
        if err != nil {
            // Fallback for different QoS classes
            path = fmt.Sprintf("/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod%s.slice/cri-containerd-%s.scope", podUID, shortID)
            f, err = os.Open(path)
            if err != nil {
                return 0, fmt.Errorf("cannot find cgroup path for container %s", shortID)
            }
        }
        defer f.Close()
    
        var stat syscall.Stat_t
        if err := syscall.Fstat(int(f.Fd()), &stat); err != nil {
            return 0, fmt.Errorf("fstat failed: %w", err)
        }
    
        // On cgroup2, the inode number of the cgroup directory is often used as the ID.
        // A more robust way involves using file handles, but this is a common method.
        return stat.Ino, nil
    }
    

    Analysis of the Go Controller:

    * bpf2go: This command from the cilium/ebpf library is crucial. It processes our C file, generates a Go file (bpf_bpfel_x86.go) containing the compiled BPF bytecode, and creates Go structs that map directly to our BPF maps and programs. This is the glue that enables CO-RE.

    * monitorPods Goroutine: This is our K8s controller. It uses a Watch on pods with the security.monitor=true label. When a pod is added, it calls getCgroupIDForPod and updates the monitored_cgroups eBPF map. When a pod is deleted, it removes the entry. This ensures our kernel filter is always synchronized with the desired state in Kubernetes.

    * The getCgroupIDForPod Problem: This function highlights a major real-world challenge. There is no simple, direct way to map a Kubernetes Pod to a cgroup ID. The implementation here is a fragile heuristic based on parsing cgroupfs paths, which differ between cgroup v1/v2, CRI runtimes (containerd, cri-o), and Kubernetes QoS classes. A production-grade system like Falco has extensive, complex logic to handle this robustly. We use a simplified fstat approach to get the inode number, which often serves as the cgroup ID in modern kernels.

    * Event Loop: The main function's primary loop blocks on rd.Read(). When an event arrives from the kernel (because it passed our cgroup filter), we parse it into our shared struct and log an alert. This is where you would integrate with systems like Prometheus, Fluentd, or a SIEM.


    Section 4: Edge Cases and Production Pitfalls

    Deploying eBPF at scale requires navigating a landscape of subtle technical challenges.

  • The Bounded Loop and Verifier Complexity: The eBPF verifier statically analyzes your program before loading it to ensure it's safe (e.g., it will terminate and won't access arbitrary memory). This means no unbounded loops. Any loops must have a constant, known upper bound. This constraint forces you to design algorithms differently. For example, you can't iterate over a string of unknown length in the kernel; you must use helpers like bpf_probe_read_str which have built-in safety.
  • Kernel Version Skew and CO-RE Limitations: While CO-RE is powerful, it's not magic. If you rely on a kernel feature or struct field that doesn't exist in an older kernel, your program will fail to load. Production fleets often have a wide range of kernel versions. Robust eBPF applications often use conditional logic (e.g., if (bpf_core_field_exists(...))) and ship multiple eBPF programs, selecting the right one at runtime based on kernel capabilities.
  • TOCTOU (Time-of-check-to-time-of-use) Attacks: Our sys_enter_execve probe has a subtle vulnerability. We read the filename argument from the process's memory. An attacker could theoretically change the contents of that memory after our eBPF probe reads it but before the kernel actually executes the syscall. The definitive solution is to use Linux Security Module (LSM) eBPF hooks (e.g., bpf_lsm_bprm_check_security), which are called later in the execution path and operate on the kernel's trusted copy of the arguments. LSM hooks can also be used to block the syscall, moving from detection to prevention.
  • Performance Overhead and Benchmarking: While eBPF is low-overhead, it's not zero-overhead. Attaching to a high-frequency syscall like read or write with complex logic can impact performance. It is critical to benchmark. Use tools like bpftool prog profile to measure the CPU cycles consumed by your eBPF program under load. Ensure your maps are correctly sized; a full map will cause insert failures, and a hash map with many collisions will degrade lookup performance.
  • Container Runtimes and PID Namespaces: A PID seen by the kernel (in our eBPF program) is not the same as the PID inside a container. Our current implementation correctly operates on the host PID. When enriching data, if you need to correlate with logs from inside a container, you must have a mechanism to translate between host PIDs and container-namespaced PIDs.
  • Conclusion: The Future of Cloud-Native Security

    We have successfully designed and implemented a sophisticated, Kubernetes-aware runtime security monitor. We didn't just dump events; we built an intelligent filtering system where the Kubernetes control plane programs the Linux kernel's security policy in real-time. This pattern of using eBPF maps as a dynamic, kernel-level filter, controlled by a user-space agent that understands application-level context (like Kubernetes labels), is a foundational technique for building scalable and efficient cloud-native security and observability tools.

    By moving logic from user space into the kernel, we drastically reduced data volume and system load. We've seen how to use CO-RE for portable, production-ready deployments and acknowledged the complex realities of interacting with kernel and container abstractions. This is the direction the industry is heading. Projects like Cilium (for networking and security), Tetragon (for security observability), and Falco (for runtime security) are all built on these advanced eBPF principles. Mastering them is no longer optional for senior engineers working at the intersection of infrastructure, security, and performance.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles