eBPF for K8s Runtime Security: Syscall Auditing at Production Scale

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Observability Gap in Containerized Environments

In a production Kubernetes cluster, the fundamental unit of execution is the container, an isolated process running in its own set of namespaces. While this provides process isolation, it creates a significant observability gap for traditional security tooling. Host-based intrusion detection systems (HIDS) operating on the node level, such as auditd, were not designed for this paradigm. They suffer from three critical flaws in a containerized world:

  • Performance Overhead: auditd operates by writing synchronous records to disk for every matching rule, creating significant I/O and CPU pressure. Enabling comprehensive syscall auditing on a busy Kubernetes node with hundreds of pods can degrade application performance by 20-40% or more, an unacceptable cost.
  • Lack of Context: An auditd event reports that PID 27182 executed /bin/sh. This is useless without context. Which container does this PID belong to? Which Pod? Which ReplicaSet, Deployment, and Namespace? Answering these questions requires complex, slow, and often racy user-space correlation logic.
  • Brittleness: Tools that rely on LD_PRELOAD or ptrace are easily bypassed. A statically compiled binary or a sophisticated attacker can simply avoid the user-space libraries being hooked, rendering the monitoring blind.
  • This is where eBPF (extended Berkeley Packet Filter) provides a transformative solution. By executing sandboxed, JIT-compiled programs directly within the kernel, we can achieve unparalleled performance and visibility. We can intercept syscalls at their source, filter uninteresting events with negligible overhead, and ship only high-signal data to a user-space agent for enrichment and analysis. This post will guide you through building a production-grade syscall auditing agent using eBPF, focusing on the advanced patterns required to make it efficient and scalable in a real-world Kubernetes environment.


    Core Architecture: Kernel Probes and User-Space Correlation

    Our system will consist of two primary components, deployed as a Kubernetes DaemonSet to run on every node:

  • eBPF Kernel Program (C): A small, efficient C program that attaches to kernel tracepoints or kprobes. Its sole responsibilities are to capture specific syscall events, populate a data structure with relevant information (PID, UID, command, etc.), and push this data into a shared memory buffer (a perf buffer or ring buffer).
  • User-Space Agent (Go): A Go application that loads and manages the eBPF program. It reads the raw event data from the kernel, enriches it with Kubernetes metadata by querying the container runtime and the K8s API server, and forwards the final, context-rich event to a logging or security information and event management (SIEM) system.
  • We will focus on monitoring the execve syscall, which is a high-signal indicator of process execution and a critical event for security monitoring.

    The eBPF Kernel Program: Capturing `execve`

    We'll use a tracepoint for sys_enter_execve as it's a stable API. The program will capture the process ID (PID), user ID (UID), group ID (GID), and the filename being executed.

    Here is the complete C program (bpf_program.c). We are using modern libbpf and CO-RE (Compile Once - Run Everywhere) principles.

    c
    // bpf_program.c
    #include "vmlinux.h"
    #include <bpf/bpf_helpers.h> 
    #include <bpf/bpf_tracing.h>
    #include <bpf/bpf_core_read.h>
    
    // Event structure sent to user-space
    struct event {
        u32 pid;
        u32 uid;
        u32 gid;
        char comm[16]; // TASK_COMM_LEN
        char filename[256];
    };
    
    // BPF perf buffer map to send events to user-space
    struct {
        __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
        __uint(key_size, sizeof(u32));
        __uint(value_size, sizeof(u32));
    } events SEC(".maps");
    
    SEC("tracepoint/syscalls/sys_enter_execve")
    int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
        struct event event = {};
        u64 id;
    
        // Get PID, UID, GID
        id = bpf_get_current_pid_tgid();
        event.pid = id >> 32;
        id = bpf_get_current_uid_gid();
        event.uid = (u32)id;
        event.gid = id >> 32;
    
        // Get process command name
        bpf_get_current_comm(&event.comm, sizeof(event.comm));
    
        // Get the filename being executed
        // The filename is the first argument of the syscall context.
        const char *filename_ptr = (const char *) BPF_CORE_READ(ctx, args[0]);
        bpf_probe_read_user_str(&event.filename, sizeof(event.filename), filename_ptr);
    
        // Submit the event to the perf buffer
        bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
    
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Implementation Details:

    * vmlinux.h: This header is generated by bpftool and provides kernel type definitions, enabling CO-RE. This avoids the need for system-wide kernel headers and makes our program portable across different kernel versions.

    * BPF_CORE_READ: This helper macro provides a safe way to read kernel memory structures, gracefully handling changes in struct layouts between kernel versions.

    * bpf_probe_read_user_str: This is a critical helper for safely copying a string from user-space memory (where the syscall arguments live) into our eBPF program's stack. It's essential for security and stability.

    * bpf_perf_event_output: This function pushes our populated event struct into the events perf buffer map, making it available for our user-space agent to read.

    The User-Space Agent: Loading and Listening

    Now, let's create the Go agent that will load this eBPF program and process its events. We'll use the excellent cilium/ebpf library.

    go
    // main.go
    package main
    
    import (
    	"bytes"
    	"encoding/binary"
    	"errors"
    	"log"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/cilium/ebpf/link"
    	"github.com/cilium/ebpf/perf"
    	"golang.org/x/sys/unix"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang bpf bpf_program.c -- -I./headers
    
    // Event structure must match the C struct exactly
    type Event struct {
    	Pid      uint32
    	Uid      uint32
    	Gid      uint32
    	Comm     [16]byte
    	Filename [256]byte
    }
    
    func main() {
    	stopper := make(chan os.Signal, 1)
    	signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
    
    	// Load pre-compiled BPF objects from the ELF file.
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %v", err)
    	}
    	defer objs.Close()
    
    	// Attach the tracepoint
    	tp, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.TracepointSyscallsSysEnterExecve, nil)
    	if err != nil {
    		log.Fatalf("attaching tracepoint: %s", err)
    	}
    	defer tp.Close()
    
    	log.Println("Waiting for events...")
    
    	// Open a perf reader from user-space on the BPF map
    	rd, err := perf.NewReader(objs.Events, os.Getpagesize())
    	if err != nil {
    		log.Fatalf("creating perf event reader: %s", err)
    	}
    	defer rd.Close()
    
    	go func() {
    		<-stopper
    		log.Println("Received signal, exiting...")
    		rd.Close()
    	}()
    
    	var event Event
    	for {
    		record, err := rd.Read()
    		if err != nil {
    			if errors.Is(err, perf.ErrClosed) {
    				return
    			}
    			log.Printf("reading from perf buffer: %s", err)
    			continue
    		}
    
    		if record.LostSamples > 0 {
    			log.Printf("perf buffer overflow: lost %d samples", record.LostSamples)
    			continue
    		}
    
    		// Parse the raw perf event data into our Go struct
    		if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
    			log.Printf("parsing perf event: %s", err)
    			continue
    		}
    
    		log.Printf("PID: %d, UID: %d, Command: %s, Filename: %s",
    			event.Pid,
    			event.Uid,
    			unix.ByteSliceToString(event.Comm[:]),
    			unix.ByteSliceToString(event.Filename[:]),
    		)
    	}
    }

    To run this, you need clang, llvm, go, and kernel headers. The //go:generate command uses bpf2go to compile the C code and embed it into a Go file (bpf_bpfel_x86.go), which makes deployment a self-contained binary.

    This basic setup works, but running it on a production Kubernetes node will immediately reveal its flaws: it generates a massive amount of noise and lacks the container context that makes the data actionable.


    Production Pattern 1: In-Kernel Filtering with BPF Maps

    The firehose of execve events from system processes, kubelet health checks, and benign application activity is overwhelming. Sending every event to user-space for filtering is wildly inefficient. The key to performance is to filter as early as possible—inside the kernel.

    Let's modify our eBPF program to ignore events from a specific set of PIDs. We'll use a BPF_MAP_TYPE_HASH map, which our user-space agent can dynamically populate with PIDs to ignore.

    Updated eBPF Program with Filtering

    c
    // bpf_program.c (updated)
    #include "vmlinux.h"
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_tracing.h>
    #include <bpf/bpf_core_read.h>
    
    // ... (event struct remains the same)
    
    struct event {
        u32 pid;
        u32 uid;
        u32 gid;
        char comm[16];
        char filename[256];
    };
    
    struct {
        __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
        __uint(key_size, sizeof(u32));
        __uint(value_size, sizeof(u32));
    } events SEC(".maps");
    
    // New map for filtering PIDs
    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 1024);
        __type(key, u32);
        __type(value, u8);
    } filter_pids SEC(".maps");
    
    SEC("tracepoint/syscalls/sys_enter_execve")
    int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
        u64 id = bpf_get_current_pid_tgid();
        u32 pid = id >> 32;
        u8 *is_filtered;
    
        // Check if the PID is in our filter map
        is_filtered = bpf_map_lookup_elem(&filter_pids, &pid);
        if (is_filtered != NULL) {
            // PID is in the map, so we ignore this event
            return 0;
        }
    
        struct event event = {};
        event.pid = pid;
        // ... (rest of the event population code remains the same)
        // ... (bpf_perf_event_output call remains the same)
    
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Updating the Filter Map from Go

    Our user-space agent can now add PIDs to this filter_pids map to dynamically silence noisy processes.

    go
    // In main.go
    
    // ... after loading bpf objects
    
    // Example: filter out our own agent's PID
    selfPid := uint32(os.Getpid())
    var value uint8 = 1
    if err := objs.FilterPids.Put(selfPid, value); err != nil {
        log.Fatalf("Failed to update filter_pids map: %v", err)
    }
    log.Printf("Filtering events from our own PID: %d", selfPid)
    
    // ... rest of the main function

    This pattern is incredibly powerful. You can extend it to filter by UID (e.g., ignore root processes), command names, or any other data available in the kernel context. This single optimization can reduce the data volume sent to user-space by over 90%, dramatically lowering the agent's CPU footprint.

    Production Pattern 2: Container Context Enrichment

    An event like {PID: 31234, Filename: "/bin/bash"} is still not actionable. We need to know this is happening inside Pod: 'billing-api-7b8c4f9b8d-z9x8w' in the prod namespace.

    This enrichment must happen in the user-space agent. It's a multi-step correlation process:

  • PID to Container ID: The agent needs to map the host PID (which eBPF sees) to a container ID. The most reliable way is to read the /proc//cgroup file on the host. The cgroup path for containerized processes typically contains the container ID.
  • Container ID to Kubernetes Metadata: Once we have the container ID, the agent queries the local container runtime (e.g., containerd, CRI-O) to get container labels. These labels, set by the kubelet, include the pod name, namespace, container name, etc.
  • (Optional) Kubernetes API Server Query: For even richer context (Deployment, ReplicaSet, labels, annotations), the agent can use the pod and namespace information to query the Kubernetes API server.
  • Here's a conceptual implementation for the enrichment logic in Go:

    go
    // A simplified enrichment service
    
    import (
        "context"
        "log"
        "sync"
    
        // You would use the official k8s client-go library
        // and a library for your container runtime (e.g., containerd client)
        v1 "k8s.io/api/core/v1"
        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
        "k8s.io/client-go/kubernetes"
    )
    
    type EnrichedEvent struct {
        OriginalEvent Event
        PodName       string
        Namespace     string
        ContainerID   string
        // ... more k8s metadata
    }
    
    // Caching is CRITICAL to avoid overwhelming the K8s API server
    type MetadataCache struct {
        mu      sync.RWMutex
        podInfo map[string]*v1.Pod // key: containerID
    }
    
    func (c *MetadataCache) Get(containerID string) (*v1.Pod, bool) {
        c.mu.RLock()
        defer c.mu.RUnlock()
        pod, found := c.podInfo[containerID]
        return pod, found
    }
    
    // This function would be run in a background goroutine to keep the cache warm
    func (c *MetadataCache) warmCache(clientset *kubernetes.Clientset) {
        // ... list all pods on the node, iterate through their container statuses,
        // and populate the map with containerID -> pod object.
    }
    
    // The main enrichment function
    func enrichEvent(event Event, cache *MetadataCache, runtimeClient /*...*/) (*EnrichedEvent, error) {
        // Step 1: Get Container ID from PID (pseudo-code)
        containerID, err := getContainerIDFromPID(event.Pid)
        if err != nil {
            // Not in a container, or process already exited
            return nil, err
        }
    
        // Step 2: Use cache to get K8s metadata
        pod, found := cache.Get(containerID)
        if !found {
            // Cache miss - could be a new pod. Perform a direct lookup as a fallback.
            // In a real system, you'd have a robust cache update mechanism.
            return nil, errors.New("pod not found in cache")
        }
    
        enriched := &EnrichedEvent{
            OriginalEvent: event,
            PodName:       pod.Name,
            Namespace:     pod.Namespace,
            ContainerID:   containerID,
        }
    
        log.Printf("Enriched Event: Namespace=%s, Pod=%s, Container=%s executed %s",
            enriched.Namespace,
            enriched.PodName,
            enriched.ContainerID,
            unix.ByteSliceToString(enriched.OriginalEvent.Filename[:]),
        )
    
        return enriched, nil
    }

    Edge Case Handling:

    * Process Exits: A process might execute and exit before the user-space agent can read its /proc entry. This is a classic race condition. While there's no perfect solution, high-performance eBPF agents using ring buffers can reduce this window to milliseconds. For critical events, you might need to combine execve monitoring with exit syscall tracing to manage the lifecycle.

    * Cache Staleness: Pods are deleted and created constantly. The metadata cache must be actively maintained, likely by watching the Kubernetes API for pod events on the current node.

    Production Pattern 3: High-Volume Syscalls and Ring Buffers

    Monitoring execve is one thing; monitoring openat, connect, or read/write is another. These can fire thousands of times per second per CPU. Using bpf_perf_event_output for such high-frequency events is a recipe for LostSamples errors, as the per-CPU buffers can't be consumed fast enough.

    The modern solution is BPF_MAP_TYPE_RINGBUF. It's a multi-producer, single-consumer (MPSC) lock-free ring buffer that offers significantly higher throughput and efficiency.

    Switching from a perf buffer to a ring buffer involves a few changes:

  • eBPF Map Definition (C):
  • c
        struct {
            __uint(type, BPF_MAP_TYPE_RINGBUF);
            __uint(max_entries, 256 * 1024); // Size in bytes, must be a power of 2
        } events SEC(".maps");
  • Submitting an Event (C):
  • c
        // Reserve space on the ring buffer
        struct event *e = bpf_ringbuf_reserve(&events, sizeof(struct event), 0);
        if (!e) {
            return 0; // Not enough space, drop event
        }
        
        // Populate the reserved space (e.g., e->pid = pid)
        // ...
        
        // Commit the event to the buffer
        bpf_ringbuf_submit(e, 0);
  • Reading from the Ring Buffer (Go):
  • go
        // In main.go, replace perf.NewReader with:
        rd, err := ringbuf.NewReader(objs.Events)
        if err != nil {
            log.Fatalf("creating ringbuf reader: %s", err)
        }
        defer rd.Close()
    
        // The read loop is almost identical
        record, err := rd.Read()
        // ...

    For extremely high-volume syscalls, even a ring buffer isn't enough. The next level of optimization is in-kernel aggregation. For example, instead of reporting every connect syscall, you could use a BPF hash map to count connections per PID to a destination IP/port pair, and only send an event to user-space once a threshold is breached. This transforms eBPF from a simple event firehose into a distributed, in-kernel threat detection engine.

    Deployment in Kubernetes

    To deploy our agent, we use a DaemonSet to ensure it runs on every node. The manifest requires elevated privileges to interact with the kernel's eBPF subsystem.

    yaml
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: ebpf-syscall-monitor
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          name: ebpf-syscall-monitor
      template:
        metadata:
          labels:
            name: ebpf-syscall-monitor
        spec:
          tolerations:
          - operator: Exists
          hostPID: true # Required to inspect /proc of other processes
          containers:
          - name: monitor-agent
            image: your-repo/ebpf-agent:latest
            securityContext:
              privileged: true # Simplest way, but can be locked down
              # Finer-grained capabilities:
              # capabilities:
              #   add:
              #   - SYS_ADMIN
              #   - BPF
              #   - PERFMON
            volumeMounts:
            - name: bpf-fs
              mountPath: /sys/fs/bpf
            - name: proc-fs
              mountPath: /host/proc
              readOnly: true
          volumes:
          - name: bpf-fs
            hostPath:
              path: /sys/fs/bpf
          - name: proc-fs
            hostPath:
              path: /proc

    Security Considerations:

    * privileged: true is powerful and dangerous. A production system should use a more restricted set of capabilities like CAP_SYS_ADMIN and CAP_BPF.

    * The agent's ServiceAccount needs RBAC permissions to read Pod information from the API server.

    Conclusion: Kernel-Native Security is the Future

    eBPF is not just another tool; it's a fundamental shift in how we build observability and security systems for cloud-native infrastructure. By moving detection logic from slow, context-poor user-space agents into the performant, all-seeing kernel, we can build systems that are both highly effective and minimally invasive.

    We have demonstrated the path from a simple proof-of-concept to a production-ready architecture by implementing critical patterns:

    * In-kernel filtering to drastically reduce data volume.

    * User-space enrichment to transform raw kernel data into actionable security intelligence.

    * Advanced eBPF features like ring buffers to handle high-frequency events at scale.

    The examples here are just the beginning. The same principles can be applied to monitor file integrity (openat, unlink), network connections (connect, accept), and virtually any other kernel activity, providing a comprehensive and tamper-resistant view of what's truly happening inside your Kubernetes cluster.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles