eBPF-based Runtime Security Monitoring for Containerized Workloads

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Observability Gap in Ephemeral, Containerized Environments

As senior engineers, we've witnessed the evolution from monolithic applications on bare metal to microservices running in ephemeral containers orchestrated by Kubernetes. While this shift has brought immense scalability and agility, it has fundamentally broken traditional host-based security models. Tools like auditd, the de-facto standard for Linux auditing, struggle in this new paradigm.

Why do they fall short?

  • High CPU Overhead: auditd operates by creating a synchronous, blocking netlink socket for every syscall event. In a high-density node running hundreds of containers, each generating thousands of syscalls per second, the performance penalty becomes untenable. It's not uncommon to see auditd consuming 10-20% of a CPU core under moderate load, a cost that's simply too high in production.
  • Lack of Container Context: An event from auditd might report that PID 12345 executed /bin/sh. But in a Kubernetes world, this is insufficient. We need to know: Which container did this happen in? Which pod? What's the container image? Which Kubernetes namespace? Correlating a host-level PID with this rich container metadata is a slow, complex, and often racy process involving user-space lookups against the container runtime API.
  • Event Filtering Inefficiency: auditd rules are powerful but filtering happens in user-space after the event has already been generated and sent from the kernel. This means the kernel does the work regardless, and you pay the performance price even for events you ultimately discard.
  • This is where eBPF (extended Berkeley Packet Filter) provides a revolutionary approach. It allows us to run sandboxed, event-driven programs directly within the Linux kernel. For runtime security, this is a game-changer. We can attach eBPF programs to various kernel hooks (syscalls, network functions, tracepoints) to observe system behavior with near-zero overhead. The filtering and initial data aggregation happen in the kernel, meaning only relevant, context-rich security signals are passed to user-space. This is the fundamental architectural shift from "collect everything and filter later" to "intelligent, in-kernel filtering first."

    This article is not an introduction to eBPF. It assumes you understand the basics of probes, maps, and the verifier. We will dive straight into two production-grade patterns for building a container-aware runtime security monitor.

    Core eBPF Primitives for Advanced Security Monitoring

    Before we build, let's refine our understanding of the tools. Choosing the right eBPF primitives is critical for performance and stability.

    Kprobes vs. Tracepoints: The Stability vs. Flexibility Trade-off

  • Tracepoints (tracepoint/syscalls/sys_enter_*): These are stable, well-defined hooks in the kernel source. The function signature for a tracepoint is part of the kernel's ABI, meaning it's unlikely to change between minor kernel versions. For syscall auditing, using the sys_enter and sys_exit tracepoints is the most robust approach. You get a stable set of arguments, and your eBPF program is less likely to break on a kernel upgrade.
  • Kprobes (kprobe/__x64_sys_execve): These allow you to attach to any kernel function, offering immense flexibility. However, this comes at a cost. Kernel function names and signatures can change without notice, even in patch releases. Relying on kprobes for core functionality can lead to brittle tooling. Their primary advantage is accessing function arguments that might not be exposed by a tracepoint. For security, a common pattern is to use tracepoints for broad syscall monitoring and reserve kprobes for deep inspection of specific, high-risk functions where the tracepoint data is insufficient.
  • Production Guideline: Start with tracepoints for syscalls. Only use kprobes when you absolutely need data not available in the tracepoint context and be prepared to handle breakages on kernel updates, often by providing multiple probe definitions for different kernel versions.

    Choosing the Right Map for the Job

    Beyond simple hash maps, specific eBPF map types are crucial for security use cases:

  • BPF_MAP_TYPE_RINGBUF: The modern standard for high-throughput, lock-free event streaming from kernel to user-space. Unlike its predecessor, BPF_MAP_TYPE_PERF_EVENT_ARRAY, the ring buffer is memory-efficient and guarantees no event loss under high load. It uses a consumer/producer model over a shared memory region, minimizing kernel/user-space boundary crossings. For sending security alerts, this is the superior choice.
  • BPF_MAP_TYPE_LPM_TRIE (Longest Prefix Match Trie): Essential for network policy enforcement. This map type is optimized for matching IP addresses against CIDR blocks. You can populate it from user-space with a set of allowed or denied IP prefixes, and the kernel-side eBPF program can perform lookups with extreme efficiency. We'll use this in our network monitoring pattern.
  • BPF_MAP_TYPE_PERCPU_*: For high-frequency event aggregation. If you want to count, for example, how many times a specific syscall is called per container, updating a global hash map would require a spinlock, creating contention across CPU cores. A PERCPU_HASH or PERCPU_ARRAY gives each CPU its own local version of the map. Updates are lock-free and blazing fast. User-space can then iterate and sum the values from all per-CPU maps for a final count.
  • CO-RE: The Key to Portable Production Deployments

    CO-RE (Compile Once – Run Everywhere) is non-negotiable for any serious eBPF project. It solves the problem of kernel struct layout changes across different versions. Instead of hardcoding struct member offsets, CO-RE uses BTF (BPF Type Format) debugging information, which is now available in most modern Linux distributions. Your user-space loader (like libbpf) uses this information at runtime to relocate your eBPF program's memory accesses to match the running kernel's layout.

    This means you can compile your eBPF C code once on a developer machine and distribute that single binary, which will run correctly on a wide range of production kernel versions without modification. This eliminates the nightmare of needing a custom-compiled eBPF program for every target kernel.


    Production Pattern 1: Container-Aware Real-time Syscall Auditing

    Goal: Detect when a shell is spawned (execve syscall) inside any container and send a detailed alert to user-space, including the container's cgroup ID, PID, and the full command.

    The eBPF Kernel-Space Program (`runtime_monitor.bpf.c`)

    This C program will be compiled into an eBPF object file. It defines the map for communication and the eBPF program logic that attaches to the execve syscall.

    c
    // SPDX-License-Identifier: GPL-2.0
    #include <vmlinux.h>
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_tracing.h>
    
    #define TASK_COMM_LEN 16
    #define MAX_ARGS 10
    #define ARG_SIZE 128
    
    // Event structure sent to user-space
    struct exec_evt {
        u64 cgroup_id;
        u32 pid;
        char comm[TASK_COMM_LEN];
        char args[MAX_ARGS][ARG_SIZE];
        int retval;
    };
    
    // Ring buffer for sending events
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024); // 256 KB
    } rb SEC(".maps");
    
    // Force emitting struct exec_evt into the ELF so libbpf can read it.
    const struct exec_evt *unused __attribute__((unused));
    
    // Attach to the exit point of execve to get the return value
    SEC("tracepoint/syscalls/sys_exit_execve")
    int tracepoint__syscalls__sys_exit_execve(struct trace_event_raw_sys_exit* ctx) {
        struct exec_evt *evt;
        u64 id = bpf_get_current_pid_tgid();
        u32 pid = id >> 32;
        int ret = ctx->ret;
    
        // We only care about successful execve calls for this example
        if (ret != 0) {
            return 0;
        }
    
        evt = bpf_ringbuf_reserve(&rb, sizeof(*evt), 0);
        if (!evt) {
            return 0;
        }
    
        evt->pid = pid;
        evt->retval = ret;
        evt->cgroup_id = bpf_get_current_cgroup_id();
        bpf_get_current_comm(&evt->comm, sizeof(evt->comm));
    
        // This is a simplified argument parsing. A production agent would be more robust.
        // We are reading from the user-space memory pointed to by ctx->regs.di (first arg of execve)
        // This is inherently unsafe, but demonstrates the concept.
        // A safer approach involves attaching at sys_enter, storing args in a map, and retrieving at sys_exit.
        // For this example, we'll keep it simple.
        struct pt_regs *regs = (struct pt_regs *)ctx;
        const char __user *const __user *args_ptr;
        const char __user *arg_ptr;
    
        // On x86-64, the second argument to execve (argv) is in the RSI register.
        bpf_probe_read_user(&args_ptr, sizeof(args_ptr), &regs->si);
    
        #pragma unroll
        for (int i = 0; i < MAX_ARGS; i++) {
            bpf_probe_read_user(&arg_ptr, sizeof(arg_ptr), &args_ptr[i]);
            if (!arg_ptr) {
                break;
            }
            bpf_probe_read_user_str(&evt->args[i], ARG_SIZE, arg_ptr);
        }
    
        bpf_ringbuf_submit(evt, 0);
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Implementation Details:

  • vmlinux.h: This header is generated by bpftool and contains all kernel type definitions for your specific architecture. This is the heart of CO-RE, allowing you to use kernel structs like pt_regs directly.
  • BPF_MAP_TYPE_RINGBUF: We define a 256KB ring buffer. If the user-space agent can't consume events fast enough, the kernel will start dropping the oldest ones, preventing a kernel memory leak.
  • tracepoint/syscalls/sys_exit_execve: We attach to the exit of the syscall. This is crucial because it allows us to see the return value (ctx->ret). We can filter for only successful executions (ret == 0), significantly reducing noise.
  • Context Enrichment in Kernel: We immediately capture the cgroup_id using bpf_get_current_cgroup_id(). This is the most efficient way to get a handle on the container context. User-space will later map this ID to a container name.
  • Argument Parsing: Reading syscall arguments that are pointers to user-space memory is complex. We use bpf_probe_read_user_str to safely copy the command-line arguments. The #pragma unroll is a hint to the compiler to unroll the loop, which is often required to pass the eBPF verifier's complexity checks, as it dislikes loops with variable bounds.
  • The User-Space Agent (`main.go`)

    This Go program uses the libbpf-go library to load, attach, and listen for events from our eBPF program.

    go
    package main
    
    import (
    	"bytes"
    	"encoding/binary"
    	"errors"
    	"fmt"
    	"log"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/cilium/ebpf/link"
    	"github.com/cilium/ebpf/ringbuf"
    	"golang.org/x/sys/unix"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf runtime_monitor.bpf.c -- -I./headers
    
    const (
    	taskCommLen = 16
    	maxArgs     = 10
    	argSize     = 128
    )
    
    type execEvt struct {
    	CgroupID uint64
    	PID      uint32
    	Comm     [taskCommLen]byte
    	Args     [maxArgs][argSize]byte
    	Retval   int32
    }
    
    func main() {
    	stopper := make(chan os.Signal, 1)
    	signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
    
    	// Load pre-compiled programs and maps into the kernel.
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %v", err)
    	}
    	defer objs.Close()
    
    	// Attach the tracepoint
    	tp, err := link.Tracepoint("syscalls", "sys_exit_execve", objs.TracepointSyscallsSysExitExecve, nil)
    	if err != nil {
    		log.Fatalf("attaching tracepoint: %v", err)
    	}
    	defer tp.Close()
    
    	log.Println("Successfully loaded and attached eBPF program. Waiting for events...")
    
    	// Open a ringbuf reader from user-space POV.
    	rd, err := ringbuf.NewReader(objs.Rb)
    	if err != nil {
    		log.Fatalf("opening ringbuf reader: %s", err)
    	}
    	defer rd.Close()
    
    	go func() {
    		<-stopper
    		log.Println("Received signal, exiting...")
    		rd.Close()
    	}()
    
    	var event execEvt
    	for {
    		record, err := rd.Read()
    		if err != nil {
    			if errors.Is(err, ringbuf.ErrClosed) {
    				log.Println("Ring buffer closed.")
    				return
    			}
    			log.Printf("error reading from ring buffer: %s", err)
    			continue
    		}
    
    		if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
    			log.Printf("parsing ringbuf event: %s", err)
    			continue
    		}
    
    		// In a real agent, you would now map CgroupID to container info.
    		// For example: query a local cache populated from the Docker/containerd API.
    		containerInfo := fmt.Sprintf("cgroup_id=%d", event.CgroupID)
    
    		command := unix.ByteSliceToString(event.Comm[:])
    		var args []string
    		for _, arg := range event.Args {
    			if arg[0] == 0 {
    				break
    			}
    			args = append(args, unix.ByteSliceToString(arg[:]))
    		}
    
    		log.Printf("ALERT: Shell execution detected in %s. PID: %d, Command: %s, Args: %v",
    			containerInfo, event.PID, command, args)
    	}
    }

    To run this example:

  • Install Go, Clang, LLVM, and libbpf-dev.
  • Generate vmlinux.h: bpftool btf dump file /sys/kernel/btf/vmlinux format c > headers/vmlinux.h
  • Run go generate to compile the C code and embed it into a Go file.
  • Run sudo go run . to start the monitor.
  • In another terminal, run a command inside a Docker container: docker run --rm -it alpine sh -c "ls -la /tmp".
  • You will see an alert logged by the Go program with the cgroup ID and the command details.


    Production Pattern 2: Network Anomaly Detection with TC and LPM Maps

    Goal: Monitor outbound TCP connections from containers and flag any connection to an IP address not on a predefined allowlist. This is a powerful way to detect command-and-control (C2) callbacks or connections to crypto-mining pools.

    We will use a Traffic Control (TC) classifier instead of a kprobe on tcp_connect. Why? TC hooks are a more natural and performant place to inspect network packets. Attaching to a veth (virtual ethernet) pair of a container allows us to precisely target a single container's traffic.

    The eBPF Kernel-Space Program (`network_monitor.bpf.c`)

    c
    // SPDX-License-Identifier: GPL-2.0
    #include <vmlinux.h>
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_endian.h>
    
    #define ETH_P_IP 0x0800 /* Internet Protocol packet */
    
    struct policy_evt {
        u64 cgroup_id;
        u32 saddr;
        u32 daddr;
        u16 dport;
    };
    
    // Ring buffer for alerts
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024);
    } rb SEC(".maps");
    
    // LPM Trie for IP allowlist. Key is prefixlen + IP.
    struct {
        __uint(type, BPF_MAP_TYPE_LPM_TRIE);
        __uint(max_entries, 1024);
        __type(key, struct bpf_lpm_trie_key);
        __type(value, u8); // Value doesn't matter, we just check for existence
        __uint(flags, BPF_F_NO_PREALLOC);
    } allowlist SEC(".maps");
    
    const struct policy_evt *unused __attribute__((unused));
    
    SEC("tc")
    int tc_monitor(struct __sk_buff *skb) {
        void *data_end = (void *)(long)skb->data_end;
        void *data = (void *)(long)skb->data;
        struct ethhdr *eth = data;
    
        if ((void *)eth + sizeof(*eth) > data_end) {
            return TC_ACT_OK;
        }
    
        if (eth->h_proto != bpf_htons(ETH_P_IP)) {
            return TC_ACT_OK;
        }
    
        struct iphdr *iph = data + sizeof(*eth);
        if ((void *)iph + sizeof(*iph) > data_end) {
            return TC_ACT_OK;
        }
    
        if (iph->protocol != IPPROTO_TCP) {
            return TC_ACT_OK;
        }
    
        struct tcphdr *tcph = (void *)iph + sizeof(*iph);
        if ((void *)tcph + sizeof(*tcph) > data_end) {
            return TC_ACT_OK;
        }
    
        // Only look at the first packet of a connection
        if (!(tcph->syn && !tcph->ack)) {
            return TC_ACT_OK;
        }
    
        // Check if destination IP is in the allowlist
        struct bpf_lpm_trie_key key = {
            .prefixlen = 32,
        };
        key.data[0] = iph->daddr;
    
        // If lookup succeeds, it's allowed. If it fails (returns NULL), it's a violation.
        if (bpf_map_lookup_elem(&allowlist, &key)) {
            return TC_ACT_OK;
        }
    
        // Policy violation detected! Send an event.
        struct policy_evt *evt;
        evt = bpf_ringbuf_reserve(&rb, sizeof(*evt), 0);
        if (!evt) {
            return TC_ACT_OK;
        }
    
        evt->cgroup_id = bpf_get_current_cgroup_id();
        evt->saddr = iph->saddr;
        evt->daddr = iph->daddr;
        evt->dport = bpf_ntohs(tcph->dest);
    
        bpf_ringbuf_submit(evt, 0);
    
        return TC_ACT_OK; // We are just monitoring, not dropping packets
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Implementation Details:

  • SEC("tc"): This section tells libbpf that this is a Traffic Control program.
  • Packet Parsing: The code carefully walks the packet headers, from Ethernet to IP to TCP, with bounds checks at each step to satisfy the verifier.
  • SYN Packet Filter: We only inspect packets where the SYN flag is set and the ACK flag is not. This isolates the very first packet of a TCP handshake, ensuring we only generate one alert per connection attempt.
  • BPF_MAP_TYPE_LPM_TRIE Lookup: The core of the policy check. We construct a key with a 32-bit prefix (a single IP) and the destination address. bpf_map_lookup_elem performs a longest-prefix match. If it finds an entry, the IP is allowed. If it returns NULL, the connection is suspicious.
  • The User-Space Agent (Conceptual Go Snippets)

    A full Go agent for this would be more complex as it needs to manage TC qdiscs and filters. Here are the key conceptual pieces using cilium/ebpf.

    go
    // Loading and attaching the TC program
    // You need to find the interface index of the container's veth
    // e.g., using netlink library
    iface, err := net.InterfaceByName("eth0") // In the container's net namespace
    
    // Create a qdisc (queueing discipline) on the interface if it doesn't exist
    qdisc := &netlink.GenericQdisc{
        QdiscAttrs: netlink.QdiscAttrs{
            LinkIndex: iface.Index,
            Handle:    netlink.MakeHandle(0xffff, 0),
            Parent:    netlink.HANDLE_CLSACT,
        },
        QdiscType: "clsact",
    }
    if err := netlink.QdiscAdd(qdisc); err != nil {
        // handle error, it might already exist
    }
    
    // Attach the eBPF program as a filter
    filter := &netlink.BpfFilter{
        FilterAttrs: netlink.FilterAttrs{
            LinkIndex: iface.Index,
            Parent:    netlink.HANDLE_MIN_EGRESS, // Hook on egress traffic
            Protocol:  unix.ETH_P_ALL,
            Priority:  1,
        },
        Fd:           objs.TcMonitor.FD(),
        Name:         "tc_monitor_out",
        DirectAction: true,
    }
    
    if err := netlink.FilterAdd(filter); err != nil {
        log.Fatalf("cannot add tc filter: %v", err)
    }
    
    // Populating the LPM Trie map from user-space
    // Assume allowlistCIDRs is a []string{"8.8.8.8/32", "1.1.1.0/24"}
    
    for _, cidrStr := range allowlistCIDRs {
        _, ipNet, err := net.ParseCIDR(cidrStr)
        if err != nil {
            log.Printf("invalid cidr: %s", cidrStr)
            continue
        }
    
        prefixLen, _ := ipNet.Mask.Size()
        // LPM key requires IP in network byte order (big endian)
        ipBytes := ipNet.IP.To4()
        if ipBytes == nil {
            continue // Skip IPv6 for this example
        }
    
        // The key struct must match the C definition
        type lpmKey struct {
            PrefixLen uint32
            IP        [4]byte
        }
    
        key := lpmKey{
            PrefixLen: uint32(prefixLen),
        }
        copy(key.IP[:], ipBytes)
        
        // The value can be anything, we use a dummy byte
        value := byte(1)
    
        if err := objs.Allowlist.Put(key, value); err != nil {
            log.Fatalf("failed to update allowlist map: %v", err)
        }
    }
    
    // The event listening loop is similar to the first example,
    // reading from the ring buffer.

    This pattern provides an incredibly efficient, in-kernel network firewall and IDS for your containers.


    Advanced Considerations and Edge Cases

    Building robust eBPF-based security tools requires navigating a minefield of performance pitfalls and verifier quirks.

    Performance: Ring Buffer vs. Perf Buffer

    For years, BPF_MAP_TYPE_PERF_EVENT_ARRAY was the standard. It works by having a per-CPU buffer that user-space memory-maps. While fast, it has a critical flaw: if user-space falls behind, events are dropped silently. For security, this is unacceptable. BPF_MAP_TYPE_RINGBUF is a multi-producer, single-consumer (MPSC) queue that solves this. It provides a contiguous memory region for events. The kernel is the producer, user-space is the consumer. If the buffer fills up, bpf_ringbuf_reserve will fail, and you can explicitly handle that case in your eBPF code (e.g., by incrementing a drop counter). For any security alerting use case, always prefer RINGBUF.

    The Verifier Gauntlet: Bounded Loops

    The eBPF verifier must prove that your program will always run to completion and not crash the kernel. One of its strictest rules is against unbounded loops. The verifier analyzes all possible paths and must be able to calculate a maximum instruction count.

    A loop like for (int i = 0; i < var; i++) where var is not a compile-time constant will be rejected. This is why our execve argument parsing loop used a constant MAX_ARGS and a #pragma unroll directive. This tells the compiler to unroll the loop into a flat sequence of instructions, making it trivial for the verifier to analyze.

    Example of a rejected loop and its fix:

    c
    // REJECTED by verifier
    int len = get_some_dynamic_length();
    for (int i = 0; i < len; i++) { ... }
    
    // ACCEPTED by verifier
    #define MAX_LEN 16
    #pragma unroll
    for (int i = 0; i < MAX_LEN; i++) {
        if (i >= len) { // Add a runtime check
            break;
        }
        ...
    }

    TOCTOU (Time-of-check to time-of-use) Vulnerabilities

    Consider our execve monitor. We read the filename argument when the syscall is made. What if an attacker quickly changes the file content or replaces it with a symlink between our eBPF check and the kernel's final execution? This is a classic TOCTOU attack.

    Mitigating this is complex. One advanced strategy is to use both sys_enter and sys_exit tracepoints. At sys_enter, you can record the arguments and a timestamp in a per-PID map. At sys_exit, you can correlate the entry data with the exit status. For file integrity, you would need to hook into even deeper kernel functions related to filesystem access (vfs_* functions), which is where tools like Falco and Tracee operate. This highlights that simple syscall auditing is just the first step.

    Conclusion: The Future is In-Kernel

    eBPF is not just another tool; it's a fundamental shift in how we build observability and security systems for the cloud-native era. By moving detection logic from slow, context-unaware user-space agents into the kernel, we can build monitors that are orders of magnitude more performant and precise.

    We've demonstrated two production-ready patterns: syscall auditing and network policy monitoring. These are the building blocks of a comprehensive Cloud Native Application Protection Platform (CNAPP) or runtime security solution. The real power comes from combining these data sources—correlating a suspicious execve with an anomalous outbound network connection from the same container provides a high-fidelity security signal that is nearly impossible to achieve with traditional tools.

    While the learning curve is steep and requires deep systems knowledge, the payoff is unparalleled visibility into your production workloads with minimal performance impact. As eBPF continues to evolve, its role as the de-facto standard for kernel-level instrumentation is all but certain. Mastering it is no longer optional for senior engineers working on infrastructure and security.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles