eBPF-based Runtime Security Monitoring for Kubernetes Pods

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Blind Spot: Runtime Security in Ephemeral Kubernetes Environments

As senior engineers, we've architected robust CI/CD pipelines, mastered infrastructure-as-code, and deployed complex microservices on Kubernetes. Yet, a persistent blind spot remains: runtime security. The dynamic, ephemeral, and multi-tenant nature of Kubernetes shatters traditional security paradigms. Static container scanning is essential but insufficient; it tells you nothing about what your code actually does once it's running. Network policies are crucial for segmentation but are blind to host-level process and file system activity.

Attempts to fill this gap often introduce significant trade-offs:

* Sidecar-based Agents: Deploying a security agent in a sidecar container for every application pod introduces non-trivial resource overhead (CPU, memory) and can increase network latency. It also lives in userspace, making it susceptible to the same container breakout vulnerabilities it's meant to detect.

* LD_PRELOAD based Interception: This technique hooks into library calls but is notoriously brittle. It can be easily bypassed by statically linked binaries or applications that make direct syscalls. It's an application-level solution for a system-level problem.

* Host-based Intrusion Detection Systems (HIDS): Traditional HIDS agents running on the node often lack Kubernetes context. An alert for process X accessed file Y is useless without knowing it originated from pod-abc in namespace-prod belonging to the billing-service deployment.

What we need is a mechanism that provides deep, kernel-level visibility, is context-aware, safe, and has minimal performance overhead. This is precisely the problem that eBPF (extended Berkeley Packet Filter) solves.

This article is not an introduction to eBPF. It assumes you understand its core concepts: sandboxed, event-driven programs that run in a kernel VM, and the role of the verifier in ensuring safety. Instead, we will focus on architecting and implementing a production-grade, eBPF-based security monitor for Kubernetes from scratch.

Architecting Our eBPF Security Monitor

Our goal is to detect suspicious activity within our pods, such as unexpected file access or process execution. We'll build a system with three core components:

  • The eBPF Programs (Kernel Probes): Small, efficient C programs that attach to specific kernel functions or tracepoints (e.g., syscall entry/exit points) to capture security-relevant events.
  • The Userspace Agent (Collector & Correlator): A privileged Go application deployed as a DaemonSet on each Kubernetes node. This agent is responsible for loading the eBPF programs, collecting events from them, enriching these events with Kubernetes metadata, and forwarding them to a logging or alerting system.
  • eBPF Maps (Kernel-Userspace Bridge): In-kernel data structures used for communication and state management between our eBPF programs and the userspace agent.
  • Here is a high-level view of the architecture on a single Kubernetes node:

    mermaid
    graph TD
        subgraph Kubernetes Node
            subgraph Kernel Space
                A[Syscall: openat, execve, etc.] -- Triggers --> B{eBPF Programs}
                B -- Writes Events --> C[eBPF Ring Buffer Map]
                B -- Reads/Writes State --> D[eBPF Hash/LPM Maps]
            end
            subgraph User Space
                E[Userspace Agent (Go DaemonSet)] -- Loads/Attaches --> B
                E -- Reads Events --> C
                E -- Manages State --> D
                E -- Queries --> F[Kubelet API]
                E -- Queries --> G[Kubernetes API Server]
                E -- Forwards Alerts --> H[SIEM / Logging Backend]
            end
            P1[Pod 1] --> A
            P2[Pod 2] --> A
        end

    Our agent will run with sufficient privileges (CAP_SYS_ADMIN, CAP_BPF) to load eBPF programs. It will use the Kubernetes API to watch for pod lifecycle events on its node, building a local cache that maps container identifiers (like cgroup IDs) to rich metadata (pod name, namespace, labels, etc.). When an eBPF program detects an event, it will push the raw data into a high-performance ring buffer. The userspace agent reads from this buffer, uses the cgroup ID from the event to look up the Kubernetes context in its cache, and then makes a policy decision.

    Part 1: The Kernel Probes for Syscall Tracing

    The core of our detection logic resides in the eBPF programs. We'll use the libbpf C library, which facilitates the modern CO-RE (Compile Once – Run Everywhere) approach. This avoids the painful process of compiling our eBPF code for every specific kernel version our nodes might be running.

    Let's start by tracing the execve syscall to monitor all new processes being executed inside our pods.

    Handling `execve`: The Challenge of Syscall Arguments

    Tracing execve is more complex than it sounds. The syscall signature is int execve(const char pathname, char const argv[], char *const envp[]);. When our eBPF program is triggered at the syscall's entry point (sys_enter_execve), the arguments (pathname, argv) are pointers to userspace memory. eBPF programs run in kernel space and cannot directly dereference arbitrary userspace pointers for safety reasons. We must use a specific helper function, bpf_probe_read_user_str(), to safely copy the data.

    Here is our initial eBPF program (bpf/monitor.c):

    c
    // SPDX-License-Identifier: GPL-2.0
    #include <vmlinux.h>
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_tracing.h>
    #include <bpf/bpf_core_read.h>
    
    #define MAX_FILENAME_LEN 256
    #define MAX_ARGS 16
    #define MAX_ARG_LEN 128
    
    // Event structure sent to userspace
    struct exec_event {
        u64 cgroup_id;
        u32 pid;
        u32 ppid;
        char filename[MAX_FILENAME_LEN];
        char comm[TASK_COMM_LEN];
        u8 args_count;
        char args[MAX_ARGS][MAX_ARG_LEN];
    };
    
    // Ring buffer for sending events to userspace
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024); // 256 KB
    } rb SEC(".maps");
    
    // Optional: force GCC to emit BTF info for custom structs
    const struct exec_event *unused __attribute__((unused));
    
    SEC("tracepoint/syscalls/sys_enter_execve")
    int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
        u64 id = bpf_get_current_pid_tgid();
        u32 pid = id >> 32;
        struct task_struct *task = (struct task_struct *)bpf_get_current_task();
        u32 ppid = BPF_CORE_READ(task, real_parent, tgid);
    
        struct exec_event *event;
        event = bpf_ringbuf_reserve(&rb, sizeof(*event), 0);
        if (!event) {
            return 0; // Ring buffer is full, drop event
        }
    
        event->pid = pid;
        event->ppid = ppid;
        event->cgroup_id = bpf_get_current_cgroup_id();
        bpf_get_current_comm(&event->comm, sizeof(event->comm));
    
        // Safely read filename from userspace pointer
        const char __user* filename_ptr = (const char __user*)ctx->args[0];
        bpf_probe_read_user_str(&event->filename, sizeof(event->filename), filename_ptr);
    
        // Safely read argv from userspace
        const char __user* const __user* argv_ptr = (const char __user* const __user*)ctx->args[1];
        event->args_count = 0;
        #pragma unroll
        for (int i = 0; i < MAX_ARGS; i++) {
            const char __user* argp = NULL;
            bpf_probe_read_user(&argp, sizeof(argp), &argv_ptr[i]);
            if (!argp) {
                break;
            }
            bpf_probe_read_user_str(&event->args[i], MAX_ARG_LEN, argp);
            event->args_count++;
        }
    
        bpf_ringbuf_submit(event, 0);
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Advanced Concepts in this Code:

  • vmlinux.h and CO-RE: We include vmlinux.h, which is a header file generated from kernel debugging information (BTF). This allows libbpf to understand kernel data structures (task_struct) and perform relocations at load time, making our eBPF object portable across different kernel versions.
  • BPF_CORE_READ: This macro is used for safely accessing fields of kernel structs. It's part of the CO-RE mechanism. For example, BPF_CORE_READ(task, real_parent, tgid) correctly reads the parent process ID regardless of how the task_struct layout changes between kernels.
  • BPF_MAP_TYPE_RINGBUF: We use a ring buffer map, the modern and most performant way to send data from kernel to userspace. It's a multi-producer, single-consumer (MPSC) lock-free buffer that avoids the overhead and potential for lost events seen in older BPF_MAP_TYPE_PERF_EVENT_ARRAY mechanisms.
  • bpf_get_current_cgroup_id(): This is the critical piece of the puzzle for Kubernetes context. It retrieves the ID of the cgroup v2 the current process belongs to. Since Kubernetes uses cgroups to isolate containers, this ID is our primary key for mapping a kernel event back to a specific pod and container.
  • #pragma unroll: The eBPF verifier forbids loops with variable bounds to prevent unbounded execution. By using #pragma unroll, we tell the compiler to unroll the loop, creating a fixed number of instructions that the verifier can analyze and approve.
  • Part 2: The Go Userspace Agent

    Now we need a Go program to load and interact with our eBPF code. We'll use the excellent cilium/ebpf library.

    The agent's responsibilities are:

    • Generate eBPF object files from our C code.
    • Load the eBPF object file into the kernel.
    • Attach the eBPF program to the specified tracepoint.
    • Open the ring buffer map and start polling for events.
    • (Crucially) Connect to the Kubernetes API to build a cgroup-to-pod metadata map.
    • Process events: enrich them with metadata and log them.

    Building and Loading the eBPF Program

    First, we need a way to compile our C code into an eBPF object file. We'll use clang and llvm-strip. A Makefile helps automate this:

    makefile
    CLANG ?= clang
    LLVM_STRIP ?= llvm-strip
    BPF_TARGET_ARCH ?= $(shell uname -m | sed 's/x86_64/bpf/' | sed 's/aarch64/bpf/')
    
    VMLINUX_H = vmlinux.h
    
    all: monitor.bpf.o
    
    # Generate vmlinux.h from BTF info
    $(VMLINUX_H): 
    	bpftool btf dump file /sys/kernel/btf/vmlinux format c > $(VMLINUX_H)
    
    # Compile C code to eBPF object file
    monitor.bpf.o: bpf/monitor.c $(VMLINUX_H)
    	$(CLANG) \
    		-g -O2 -target $(BPF_TARGET_ARCH) \
    		-I. \
    		-c bpf/monitor.c \
    		-o $@
    	$(LLVM_STRIP) -g $@
    
    clean:
    	rm -f monitor.bpf.o $(VMLINUX_H)

    We can embed the compiled object file directly into our Go binary using go:embed.

    The Go Agent Code (`main.go`)

    Here's a simplified but functional version of the agent's core logic.

    go
    package main
    
    import (
    	"bytes"
    	"context"
    	"encoding/binary"
    	"errors"
    	"log"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/cilium/ebpf"
    	"github.com/cilium/ebpf/link"
    	"github.com/cilium/ebpf/ringbuf"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target bpfel -type exec_event bpf ./bpf/monitor.c -- -I./
    
    const (
    	TASK_COMM_LEN   = 16
    	MAX_FILENAME_LEN = 256
    	MAX_ARGS         = 16
    	MAX_ARG_LEN      = 128
    )
    
    func main() {
    	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
    	defer stop()
    
    	// Load pre-compiled programs and maps into the kernel.
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %v", err)
    	}
    	defer objs.Close()
    
    	// Attach the tracepoint program.
    	tp, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.TracepointSyscallsSysEnterExecve, nil)
    	if err != nil {
    		log.Fatalf("attaching tracepoint: %v", err)
    	}
    	defer tp.Close()
    
    	log.Println("eBPF programs loaded and attached. Waiting for events...")
    
    	// Open a ringbuf reader from the eBPF map.
    	rd, err := ringbuf.NewReader(objs.Rb)
    	if err != nil {
    		log.Fatalf("opening ringbuf reader: %s", err)
    	}
    	defer rd.Close()
    
    	// Goroutine to handle closing the reader on interrupt.
    	go func() {
    		<-ctx.Done()
    		rd.Close()
    	}()
    
    	var event bpfExecEvent
    	for {
    		record, err := rd.Read()
    		if err != nil {
    			if errors.Is(err, ringbuf.ErrClosed) {
    				log.Println("Received signal, exiting...")
    				return
    			}
    			log.Printf("error reading from ringbuf: %s", err)
    			continue
    		}
    
    		// Parse the event data.
    		if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
    			log.Printf("parsing ringbuf event: %s", err)
    			continue
    		}
    
    		// *** This is where enrichment would happen ***
    		// podMeta := metadataCache.Get(event.CgroupId)
    
    		log.Printf("PID: %d, PPID: %d, Cgroup: %d, Comm: %s, Filename: %s",
    			event.Pid, event.Ppid, event.CgroupId, unix.ByteSliceToString(event.Comm[:]), unix.ByteSliceToString(event.Filename[:]))
    	}
    }

    Important Details:

    * bpf2go: This tool from the cilium/ebpf project is fantastic. It takes our C eBPF code, compiles it, and generates Go code that handles loading the object file and provides typed Go structs that match our C structs. This eliminates a ton of boilerplate.

    * link.Tracepoint: This is the modern way to attach eBPF programs, replacing older ioctl-based methods. It returns a link.Link object, and closing this link automatically detaches the program.

    * ringbuf.NewReader: We create a reader for our ring buffer map. The Read() call blocks until a new event is available.

    Part 3: The Missing Piece - Kubernetes Context Enrichment

    Receiving a cgroup_id is great, but it's just a number. To make our security events actionable, we must map this ID to a Pod, Namespace, and Deployment.

    This is a non-trivial engineering problem in itself. The general pattern is:

  • Use the Kubernetes Go Client: Our agent needs an in-cluster rest.Config to talk to the API server.
  • Watch Pods: The agent should WATCH for pod events on the node it's running on. We can get the node name from the NODE_NAME environment variable (injected via the Downward API in the DaemonSet manifest).
  • Build a Local Cache: For each pod, extract its metadata (name, namespace, labels) and status.containerStatuses. The containerID field in the container status (e.g., containerd://) is key.
  • Parse Cgroup Path: The Kubernetes runtime (like containerd) creates cgroup paths that include the pod's UID and container ID. For a pod with UID pod_uid and container ID container_id, the cgroup path might look something like /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod.slice/cri-containerd-.scope.
  • Correlate: When a new pod is scheduled, the agent can read its cgroup ID from the filesystem (e.g., from /sys/fs/cgroup/...) and store a mapping: cgroup_id -> PodMetadata in a concurrent-safe map.
  • When our eBPF event loop receives an event with CgroupId: 12345, it does a quick lookup in this local cache. Now, the log message can be transformed from cgroup 12345 executed /bin/bash to ALERT: Shell spawned in pod 'billing-api-7f... ' (namespace: prod, deployment: billing-api). This is an actionable security signal.

    Here's a conceptual snippet of the enrichment logic:

    go
    // In your agent's main struct
    type Agent struct {
        // ... other fields
        metadataCache *sync.Map // In a real implementation, use a more structured, TTL-based cache
    }
    
    // Goroutine that watches the K8s API
    func (a *Agent) watchPods(ctx context.Context, clientset *kubernetes.Clientset, nodeName string) {
        // ... setup watcher for pods on `nodeName` ...
    
        for event := range watcher.ResultChan() {
            pod, ok := event.Object.(*v1.Pod)
            if !ok { continue }
    
            switch event.Type {
            case watch.Added, watch.Modified:
                // For each container in the pod, get its cgroup ID and update the cache
                cgroupId := getCgroupIdForPod(pod) // This is a complex helper function!
                a.metadataCache.Store(cgroupId, buildPodMetadata(pod))
            case watch.Deleted:
                cgroupId := getCgroupIdForPod(pod)
                a.metadataCache.Delete(cgroupId)
            }
        }
    }
    
    // In the event processing loop
    func (a *Agent) processEvent(event bpfExecEvent) {
        var meta string
        if podMeta, ok := a.metadataCache.Load(event.CgroupId); ok {
            meta = fmt.Sprintf("Pod: %s, Namespace: %s", podMeta.Name, podMeta.Namespace)
        } else {
            meta = "(unknown context)"
        }
        log.Printf("EXEC EVENT [%s]: %s", meta, unix.ByteSliceToString(event.Filename[:]))
    }

    Production Considerations, Edge Cases, and Performance

    Building a toy is one thing; running it in production is another. Here are the critical considerations for a senior engineering team.

    Edge Case: Containerd Shim Processes

    When you kubectl exec into a container, you might not see the execve event from your target process directly. Instead, you might see containerd's shim process (containerd-shim-runc-v2) executing the command. The process you're interested in will be a child of this shim. This means robust detection requires process ancestry tracking. You'd need to trace fork and exec calls and maintain a process tree in an eBPF map (BPF_MAP_TYPE_HASH mapping pid -> parent_pid/metadata) to correctly attribute the final executed binary to the originating pod context.

    Performance: Kernel Filtering is Key

    Sending every single execve event from the kernel to userspace is inefficient and noisy. On a busy system, this could be thousands of events per second. The power of eBPF is the ability to filter in the kernel.

    Example: Let's only send events for processes executed out of suspicious directories like /tmp or /var/tmp.

    We can't do string comparison directly in eBPF easily. A common pattern is to send the event to userspace and let it do the filtering. A more advanced, but highly efficient, pattern is to use a set of eBPF maps.

  • Create a BPF_MAP_TYPE_HASH map where the key is a char[256] (the filename) and the value is a u8 (boolean flag).
  • The userspace agent populates this map with a list of allowed binaries (e.g., /usr/bin/python3, /app/server).
  • The eBPF program performs a bpf_map_lookup_elem() on this map. If the filename is not found in the allowlist, then and only then does it reserve space in the ring buffer and send the event.
  • This reduces kernel-userspace traffic by orders of magnitude, dramatically lowering the agent's CPU overhead.

    High Event Volume and Ring Buffer Overflows

    Even with filtering, a security incident (like a fork bomb) could generate a massive burst of events, potentially overflowing the ring buffer. The bpf_ringbuf_reserve call will start returning NULL, and we will lose events.

    This is a fundamental trade-off in observability. While you can increase the ring buffer size, it consumes non-swappable kernel memory. The better strategy includes:

    * In-kernel Aggregation: For some event types (e.g., network connections), you can aggregate stats in an eBPF map. For instance, map (source_ip, dest_ip) -> count and only send an update to userspace periodically or when a threshold is breached.

    * Userspace Backpressure: The userspace agent should be designed to consume events as fast as possible. If its downstream sink (e.g., a SIEM) is slow, the agent should buffer events in userspace memory or even drop less critical events to keep up with the kernel.

    * Monitoring: The agent must monitor for dropped events. The cilium/ebpf/ringbuf reader doesn't directly expose this, but in a perf_event_array setup, you can read a counter of lost samples. For ring buffers, you'd need to implement a health check in your agent to see if it's falling behind.

    Deploying to Kubernetes

    Our agent must be deployed as a DaemonSet to ensure it runs on every node. The manifest requires specific security contexts and volume mounts to access kernel resources.

    yaml
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: ebpf-security-monitor
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          name: ebpf-security-monitor
      template:
        metadata:
          labels:
            name: ebpf-security-monitor
        spec:
          tolerations:
          - operator: Exists
          hostPID: true # Required for process ancestry tracking
          containers:
          - name: monitor-agent
            image: your-registry/ebpf-monitor-agent:latest
            securityContext:
              privileged: true # Simplest way, but can be locked down
              # For a non-privileged setup, you need:
              # capabilities:
              #   add:
              #   - SYS_ADMIN
              #   - BPF
            env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            volumeMounts:
            - name: bpf-fs
              mountPath: /sys/fs/bpf
              mountPropagation: HostToContainer
            - name: debug-fs
              mountPath: /sys/kernel/debug
          volumes:
          - name: bpf-fs
            hostPath:
              path: /sys/fs/bpf
          - name: debug-fs
            hostPath:
              path: /sys/kernel/debug

    Conclusion: The Future is Kernel-Level

    We have journeyed from the high-level problem of runtime security in Kubernetes down to the intricate details of kernel-level syscall tracing, memory management, and context enrichment. By leveraging eBPF with a CO-RE approach, we've built the foundation of a security tool that is highly performant, difficult to evade, and deeply context-aware.

    This approach is not a silver bullet, but it represents a paradigm shift. It moves security monitoring from brittle application-level shims or resource-intensive sidecars to a centralized, efficient, and programmable layer within the kernel itself. This is the same technology that powers modern cloud-native networking (Cilium), service meshes (Istio's eBPF datapath), and observability (Pixie, Parca).

    For senior engineers tasked with securing complex, large-scale systems, mastering eBPF is no longer optional. It is the key to unlocking the next generation of secure, observable, and high-performance cloud-native infrastructure.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles