Kernel-Level Runtime Security via eBPF for Container Threat Detection

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Observability Gap in Container Security

In modern microservices architectures, container orchestration platforms like Kubernetes have introduced layers of abstraction that render traditional security monitoring tools partially blind. Host-based Intrusion Detection Systems (HIDS) that rely on user-space agents or log scraping struggle with the ephemeral nature of containers and the complexities of PID and network namespaces. A process ID (PID) inside a container is meaningless on the host without proper context, and network traffic analysis at the host level can be difficult to attribute to a specific pod.

Techniques like LD_PRELOAD hijacking or ptrace are highly intrusive, incur significant performance penalties, and are easily bypassed by statically linked binaries or sophisticated malware. Kernel modules (LKMs) offer the necessary visibility but introduce immense risk; a bug in a kernel module can lead to a complete system crash (kernel panic) and creates a maintenance nightmare across different kernel versions.

This is the problem space where eBPF (extended Berkeley Packet Filter) has emerged as the definitive technology. It allows us to run sandboxed, event-driven programs within the Linux kernel itself. For runtime security, this is a paradigm shift. We can attach eBPF programs to tracepoints, kprobes, and other kernel hooks to observe system behavior—like syscalls, network activity, and file access—at the source, with minimal overhead and without modifying kernel source code or loading unstable modules. The in-kernel eBPF verifier ensures that programs are safe to run, preventing infinite loops, out-of-bounds memory access, and other dangerous operations.

This article is not an introduction to eBPF. It assumes you understand the basic architecture: kernel-space eBPF programs written in restricted C and a user-space controller application that loads and interacts with them. We will focus on building a production-viable security monitor that addresses a common container escape or privilege escalation pattern: a compromised process attempting to access sensitive host files.


Architecture of our eBPF Security Monitor

Our goal is to detect a specific, malicious chain of events: a process running within a container (e.g., a web server like nginx) that unexpectedly executes a shell (/bin/sh) and then attempts to open a sensitive host file like /etc/shadow.

Our system will consist of two main components:

  • eBPF Kernel-Space Program (C): A set of eBPF programs attached to relevant syscall tracepoints (sys_enter_execve, sys_enter_openat). These programs will capture execution and file access data.
  • User-Space Controller (Go): A Go application using the cilium/ebpf library to load, attach, and manage the eBPF programs. It will read event data from a high-performance eBPF ring buffer and apply detection logic.
  • We will use CO-RE (Compile Once - Run Everywhere) principles, leveraging BTF (BPF Type Format) to ensure our eBPF program is portable across different kernel versions without recompilation. This is critical for production deployments.

    The Kernel-Space eBPF Program (`monitor.c`)

    Let's start by defining the data structure for events we'll send to user space and the eBPF maps we'll use for communication and state management.

    c
    // monitor.c
    #include "vmlinux.h"
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_tracing.h>
    #include <bpf/bpf_core_read.h>
    
    // Define a common event structure for different syscalls
    struct event {
        u32 host_pid;
        u32 host_ppid;
        u64 mnt_ns;
        char comm[16];
        char filename[256];
        u8 event_type; // 1 for exec, 2 for open
    };
    
    // Ring buffer for sending events to user space. This is preferred over perf buffers
    // for high-throughput scenarios due to better performance and no data overwrites.
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024); // 256 KB
    } rb SEC(".maps");
    
    // Optional: a hash map to track suspicious PIDs if we were doing more complex correlation
    // struct {
    //     __uint(type, BPF_MAP_TYPE_HASH);
    //     __uint(max_entries, 8192);
    //     __type(key, u32);
    //     __type(value, u8); // 1 = suspicious
    // } suspicious_pids SEC(".maps");
    
    // Force emitting struct event into the ELF.
    const struct event *unused __attribute__((unused));
    
    // Helper to get mount namespace ID for container identification
    static __always_inline u64 get_mnt_ns_id(struct task_struct *task) {
    #ifdef BPF_CORE_READ
        // CO-RE compatible way
        return BPF_CORE_READ(task, nsproxy, mnt_ns, ns.inum);
    #else
        // Fallback for older kernels without full CO-RE/BTF
        struct nsproxy *nsproxy;
        struct mnt_namespace *mnt_ns;
        bpf_probe_read_kernel(&nsproxy, sizeof(nsproxy), &task->nsproxy);
        bpf_probe_read_kernel(&mnt_ns, sizeof(mnt_ns), &nsproxy->mnt_ns);
        return mnt_ns->ns.inum;
    #endif
    }
    
    // Attach to the tracepoint for the execve syscall
    SEC("tracepoint/syscalls/sys_enter_execve")
    int tracepoint__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
        struct event *e;
        u64 id = bpf_get_current_pid_tgid();
        u32 pid = id >> 32;
        struct task_struct *task = (struct task_struct *)bpf_get_current_task();
    
        // Reserve space on the ring buffer for our event
        e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
        if (!e) {
            return 0;
        }
    
        // Populate the event structure
        e->host_pid = pid;
        e->host_ppid = BPF_CORE_READ(task, real_parent, tgid);
        e->mnt_ns = get_mnt_ns_id(task);
        e->event_type = 1;
    
        bpf_get_current_comm(&e->comm, sizeof(e->comm));
        
        // Read the filename argument from the syscall context
        const char *filename_ptr = (const char *)ctx->args[0];
        bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
    
        // Submit the event to the ring buffer
        bpf_ringbuf_submit(e, 0);
    
        return 0;
    }
    
    // Attach to the tracepoint for the openat syscall
    SEC("tracepoint/syscalls/sys_enter_openat")
    int tracepoint__sys_enter_openat(struct trace_event_raw_sys_enter* ctx) {
        struct event *e;
        u64 id = bpf_get_current_pid_tgid();
        u32 pid = id >> 32;
        struct task_struct *task = (struct task_struct *)bpf_get_current_task();
        
        // For this example, we only care about specific files.
        // In a real system, filtering should be more dynamic, perhaps via another map.
        const char *filename_ptr = (const char *)ctx->args[1];
        char filename_buf[256];
        bpf_probe_read_user_str(&filename_buf, sizeof(filename_buf), filename_ptr);
    
        // KERNEL-SIDE FILTERING: This is critical for performance.
        // Do not send every openat event to user space.
        if (bpf_strncmp(filename_buf, 11, "/etc/shadow") != 0 &&
            bpf_strncmp(filename_buf, 10, "/etc/hosts") != 0 &&
            bpf_strncmp(filename_buf, 11, "/etc/passwd") != 0) {
            return 0;
        }
    
        e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
        if (!e) {
            return 0;
        }
    
        e->host_pid = pid;
        e->host_ppid = BPF_CORE_READ(task, real_parent, tgid);
        e->mnt_ns = get_mnt_ns_id(task);
        e->event_type = 2;
        bpf_get_current_comm(&e->comm, sizeof(e->comm));
        bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
    
        bpf_ringbuf_submit(e, 0);
    
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Advanced Implementation Details Here:

  • vmlinux.h: This header is generated by bpftool and contains kernel type definitions. It's the foundation of CO-RE, allowing our eBPF code to access kernel structs like task_struct in a portable way.
  • BPF_MAP_TYPE_RINGBUF: We chose a ring buffer over a perf buffer. Why? Ring buffers are a newer, more performant mechanism. They guarantee event order, do not drop data under load (events are discarded at the producer side if the buffer is full, but the user-space consumer doesn't see torn events), and involve fewer memory copies, making them ideal for high-throughput syscall tracing.
  • Kernel-Side Filtering: In the sys_enter_openat probe, we explicitly check if the filename matches our sensitive list before reserving space on the ring buffer. Sending every single openat event to user space would create an overwhelming firehose of data. Effective runtime security relies on aggressive, intelligent filtering at the earliest possible stage—inside the kernel.
  • Container Context (mnt_ns): We capture the mount namespace ID (mnt_ns). In a containerized environment, this is a more reliable identifier for a container's context than the PID. Multiple containers can run processes with the same in-container PID (e.g., PID 1), but their mount namespace IDs will be unique. A production system would correlate this mnt_ns with container metadata from the Docker or containerd API to get the container name, image, and pod information.

  • The User-Space Go Controller (`main.go`)

    The Go application is responsible for the lifecycle of the eBPF program.

    go
    // main.go
    package main
    
    import (
    	"bytes"
    	"encoding/binary"
    	"errors"
    	"log"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/cilium/ebpf/link"
    	"github.com/cilium/ebpf/ringbuf"
    	"github.com/cilium/ebpf/rlimit"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall -Werror" bpf monitor.c -- -I./headers
    
    // Event represents the data structure sent from the eBPF program.
    // It must match the C struct exactly.
    type Event struct {
    	HostPid    uint32
    	HostPpid   uint32
    	MntNs      uint64
    	Comm       [16]byte
    	Filename   [256]byte
    	EventType  uint8
    }
    
    func main() {
    	// Subscribe to signals for graceful shutdown.
    	stopper := make(chan os.Signal, 1)
    	signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
    
    	// Allow the eBPF loader to use locked memory.
    	if err := rlimit.RemoveMemlock(); err != nil {
    		log.Fatal(err)
    	}
    
    	// Load pre-compiled programs and maps into the kernel.
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %v", err)
    	}
    	defer objs.Close()
    
    	// Attach the execve tracepoint program.
    	execLink, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.TracepointSysEnterExecve, nil)
    	if err != nil {
    		log.Fatalf("attaching execve tracepoint: %v", err)
    	}
    	defer execLink.Close()
    
    	// Attach the openat tracepoint program.
    	openLink, err := link.Tracepoint("syscalls", "sys_enter_openat", objs.TracepointSysEnterOpenat, nil)
    	if err != nil {
    		log.Fatalf("attaching openat tracepoint: %v", err)
    	}
    	defer openLink.Close()
    
    	log.Println("Successfully loaded and attached eBPF programs. Waiting for events...")
    
    	// Open a reader from the ring buffer map.
    	rd, err := ringbuf.NewReader(objs.Rb)
    	if err != nil {
    		log.Fatalf("opening ringbuf reader: %v", err)
    	}
    	defer rd.Close()
    
    	// Start a goroutine to handle program shutdown.
    	go func() {
    		<-stopper
    		log.Println("Received signal, exiting...")
    		if err := rd.Close(); err != nil {
    			log.Fatalf("closing ringbuf reader: %v", err)
    		}
    	}()
    
    	var event Event
    	for {
    		record, err := rd.Read()
    		if err != nil {
    			if errors.Is(err, ringbuf.ErrClosed) {
    				log.Println("Ring buffer closed.")
    				return
    			}
    			log.Printf("error reading from ring buffer: %s", err)
    			continue
    		}
    
    		// Parse the raw data into our Go struct.
    		if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
    			log.Printf("parsing ringbuf event: %s", err)
    			continue
    		}
    
    		// Apply detection logic here.
    		processEvent(event)
    	}
    }
    
    // processEvent is where we implement our detection logic and correlation.
    func processEvent(event Event) {
    	comm := string(bytes.TrimRight(event.Comm[:], "\x00"))
    	filename := string(bytes.TrimRight(event.Filename[:], "\x00"))
    
    	switch event.EventType {
    	case 1: // execve
    		log.Printf("[EXEC] PID: %d, PPID: %d, Comm: %s, Filename: %s, MntNS: %d",
    			event.HostPid, event.HostPpid, comm, filename, event.MntNs)
    
    		// ADVANCED LOGIC: Check if a web server process is spawning a shell.
    		if (comm == "nginx" || comm == "apache2") && (filename == "/bin/sh" || filename == "/bin/bash") {
    			log.Printf("*** HIGH SEVERITY ALERT: Web server '%s' (PID: %d) spawned a shell '%s' in MntNS %d ***",
    				comm, event.HostPid, filename, event.MntNs)
    		}
    	case 2: // openat
    		log.Printf("[OPEN] PID: %d, PPID: %d, Comm: %s, Filename: %s, MntNS: %d",
    			event.HostPid, event.HostPpid, comm, filename, event.MntNs)
    		
    		// Here, you would correlate this with previous events.
    		// For example, check if this PID was recently flagged for spawning a shell.
    		log.Printf("*** MEDIUM SEVERITY ALERT: Process '%s' (PID: %d) accessed sensitive file '%s' in MntNS %d ***",
    			comm, event.HostPid, filename, event.MntNs)
    	}
    }

    To make this runnable:

  • Prerequisites: You need Go, Clang, LLVM, and libbpf-dev installed. You also need kernel headers.
  • Generate Headers: Generate vmlinux.h using bpftool: bpftool btf dump file /sys/kernel/btf/vmlinux format c > headers/vmlinux.h.
  • Generate BPF Go bindings: The go:generate directive will handle this. Running go generate ./... will create bpf_bpfel_x86.go and bpf_bpfel_x86.o from monitor.c.
  • Run: sudo go run .
  • When you run this and then execute commands in a separate terminal (or inside a container), you will see the output in the Go application's log.


    Edge Cases and Production Hardening

    What we've built is a powerful foundation, but a production system requires handling numerous edge cases.

    1. The PID Namespace Problem

    An eBPF program attached to a host-level tracepoint will always see the host PID. If a process with PID 500 inside a container executes /bin/sh, the kernel will see it as, for example, host PID 12345. Our mnt_ns is the key to solving this.

    A production agent must maintain a cache mapping mnt_ns values to container IDs, pod names, and other Kubernetes metadata. This is typically done by watching the container runtime's API (e.g., the Docker socket or containerd's gRPC API). When an event with mnt_ns=4026531840 arrives, the user-space agent looks up this ID in its cache and enriches the event with context: "pod": "frontend-abc-123", "container": "nginx", "image": "nginx:1.21".

    2. High-Volume Event Storms and Kernel-Side Aggregation

    Imagine a process that legitimately opens thousands of files per second. Sending every event to user space is untenable. This is where kernel-side aggregation becomes a critical optimization pattern.

    Instead of a ring buffer, we could use a BPF_MAP_TYPE_HASH to count occurrences of events inside the kernel. For example, we could create a map where the key is a struct {pid, filename} and the value is a counter.

    c
    // A struct for the aggregation map key
    struct file_access_key {
        u32 pid;
        char filename[64];
    };
    
    // The aggregation map
    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 10240);
        __type(key, struct file_access_key);
        __type(value, u64); // count
    } file_access_counts SEC(".maps");
    
    // Inside the openat tracepoint...
    SEC("tracepoint/syscalls/sys_enter_openat")
    int tracepoint__sys_enter_openat(...) {
        // ... get pid and filename ...
        struct file_access_key key = {};
        key.pid = pid;
        bpf_probe_read_user_str(&key.filename, sizeof(key.filename), filename_ptr);
    
        // Atomically increment the counter for this key
        u64 *count = bpf_map_lookup_elem(&file_access_counts, &key);
        if (count) {
            __sync_fetch_and_add(count, 1);
        } else {
            u64 init_val = 1;
            bpf_map_update_elem(&file_access_counts, &key, &init_val, BPF_ANY);
        }
        return 0;
    }

    The user-space application would then periodically iterate over this map (e.g., every 10 seconds), read the aggregated counts, and reset the map. This drastically reduces the data flow from kernel to user space from millions of events per second to a few thousand map entries every 10 seconds.

    3. Evading the Verifier's Complexity Limits

    The eBPF verifier imposes strict limits on program size (1 million instructions since kernel 5.3) and complexity (e.g., no unbounded loops). Complex detection logic can easily exceed these limits.

    The advanced pattern to solve this is BPF-to-BPF function calls and tail calls.

  • Function Calls: Since kernel 5.10, you can define helper functions within your eBPF C code and call them from your main program, just like normal C. The verifier analyzes them inline.
  • Tail Calls (bpf_tail_call): This is a more powerful mechanism that allows one eBPF program to jump to another, effectively creating a state machine. For example, a main execve probe could handle common cases and then use a tail call to jump to a specialized eBPF program if the binary being executed is suspicious (e.g., strace or socat). This allows you to partition complex logic into smaller, verifiable programs.
  • c
    // A map that holds references to other eBPF programs
    struct {
        __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
        __uint(max_entries, 5);
        __type(key, u32);
        __type(value, u32);
    } prog_array SEC(".maps");
    
    // In the main probe...
    SEC("tracepoint/syscalls/sys_enter_execve")
    int main_prog(...) {
        // ... some initial checks ...
    
        if (is_suspicious) {
            // Jump to the program at index 0 in the prog_array map
            bpf_tail_call(ctx, &prog_array, 0);
        }
        return 0;
    }
    
    // A specialized handler for suspicious execs
    SEC("tracepoint/special_handler")
    int special_handler(...) {
        // ... do more detailed analysis ...
        return 0;
    }

    In user space, you would load both main_prog and special_handler and then insert the file descriptor of special_handler into the prog_array map at index 0.

    Conclusion: The Future of Cloud-Native Security

    eBPF is not just another tool; it represents a fundamental shift in how we build observability and security systems for the cloud-native era. By moving detection logic from user-space agents into the sandboxed, high-performance environment of the kernel, we can achieve a level of visibility and efficiency that was previously impossible.

    This article demonstrated a practical, albeit simplified, implementation of a runtime security monitor. We focused on production-oriented patterns: using CO-RE for portability, choosing the right map types for performance (ringbuf), performing aggressive kernel-side filtering, and understanding how to add container context. We also explored advanced techniques like kernel-side aggregation and tail calls to handle the scale and complexity of real-world production environments. Building robust security tooling requires mastering these advanced eBPF patterns, as they are the key to creating systems that are not only effective but also efficient and maintainable at scale.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles