eBPF for Real-Time K8s Pod Threat Detection via Syscall Hooks

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Observability Gap: Why Sidecars Fall Short for Runtime Security

In modern Kubernetes environments, runtime security is a non-negotiable requirement. The standard approach has often involved injecting a security agent as a sidecar container into each pod or deploying a privileged agent on each node. Both models present significant trade-offs that senior engineers must contend with.

Sidecars, while providing application-level context, introduce resource overhead (CPU/memory per pod), increase latency for network-based policies, and complicate the deployment lifecycle. Node-level agents, while more efficient, often rely on higher-level container runtime interfaces or periodic process table scans, creating a temporal gap where a malicious, short-lived process can execute and terminate between scans, completely evading detection.

Both approaches struggle with the kernel-userspace boundary. Intrusive methods like ptrace impose a heavy performance penalty, while less intrusive methods lack the fidelity to capture every critical system call. This is the core problem: we need a mechanism that offers comprehensive, kernel-level visibility with minimal performance overhead, tailored for the ephemeral and dynamic nature of Kubernetes pods. This is precisely the niche where eBPF (extended Berkeley Packet Filter) excels.

This article assumes you understand the fundamentals of eBPF—what it is, the role of the verifier, and the concept of maps. We will dive directly into building a sophisticated pod security monitoring tool that hooks directly into system calls to detect threats in real-time.


Core Implementation: Hooking `execve` with a `kprobe`

Our first objective is to detect every new process execution within any pod on a node. The canonical system call for this is execve. We'll use a kprobe (kernel probe) to dynamically attach our eBPF program to the entry point of the kernel function that handles this syscall.

1. The eBPF Kernel-Space Program (C)

This C code is not meant to be compiled with a standard C compiler like GCC. It's compiled using Clang/LLVM into eBPF bytecode. We'll use libbpf conventions and helpers.

pod_monitor.bpf.c

c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

// Define a common structure for events to be sent to user space.
// This ensures a consistent data format.
#define TASK_COMM_LEN 16
#define MAX_FILENAME_LEN 256

struct event {
    __u32 pid;
    __u32 ppid;
    char comm[TASK_COMM_LEN];
    char filename[MAX_FILENAME_LEN];
};

// Ring buffer for sending events to user space.
// Ringbuf is generally preferred over perf buffers for high-throughput scenarios
// as it avoids data loss under heavy load and is more memory efficient.
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");

// This is the kprobe attached to the entry of the execve syscall.
// The function name format `__x64_sys_execve` is specific to the x86_64 architecture.
// Using tracepoints like `syscalls/sys_enter_execve` is often more stable across kernel versions,
// but we'll start with kprobe for demonstration.
SEC("kprobe/__x64_sys_execve")
int BPF_KPROBE(handle_execve, const struct pt_regs *regs)
{
    struct event *e;
    struct task_struct *task;
    __u64 id;
    __u32 pid, ppid;
    const char *filename_ptr;

    // Get PID and Parent PID
    id = bpf_get_current_pid_tgid();
    pid = id >> 32;

    // Reserve space on the ring buffer for our event
    e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (!e) {
        return 0; // Not enough space, drop event
    }

    // Get task_struct for more info
    task = (struct task_struct *)bpf_get_current_task();
    ppid = BPF_CORE_READ(task, real_parent, tgid);

    // Populate our event structure
    e->pid = pid;
    e->ppid = ppid;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));

    // Read the filename argument from the first register (x86_64 convention)
    // This is an unsafe read from user-space memory, so we use a helper.
    filename_ptr = (const char *)PT_REGS_PARM1(regs);
    bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);

    // Submit the event to user space
    bpf_ringbuf_submit(e, 0);

    return 0;
}

char LICENSE[] SEC("license") = "GPL";

Key Implementation Details:

* vmlinux.h: This header is generated by bpftool and contains kernel type definitions. It's essential for CO-RE (Compile Once - Run Everywhere), allowing our program to work across different kernel versions without recompilation.

* BPF_MAP_TYPE_RINGBUF: We're using a ring buffer map. Unlike perf buffers, ring buffers are multi-producer, single-consumer lockless rings that prevent event overwriting under load, making them ideal for high-volume syscall monitoring.

* BPF_CORE_READ: This macro is a CO-RE helper that safely reads kernel struct fields. If the kernel struct layout changes in a future version, the eBPF loader can perform runtime relocations to ensure the correct field is accessed.

* bpf_probe_read_user_str: Reading arguments from syscalls involves reading from user-space memory, which is inherently unsafe. This helper function provides a safe way to copy the string data into our eBPF program's stack.

2. The User-Space Controller (Go)

This Go program is responsible for loading the eBPF bytecode into the kernel, attaching the probe, and listening for events from the ring buffer.

We'll use the excellent cilium/ebpf library.

main.go

go
package main

import (
    "bytes"
    "encoding/binary"
    "log"
    "os"
    "os/signal"
    "syscall"

    "github.com/cilium/ebpf/link"
    "github.com/cilium/ebpf/ringbuf"
    "golang.org/x/sys/unix"
)

//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall -Werror" bpf pod_monitor.bpf.c -- -I./headers

const (
    taskCommLen    = 16
    maxFilenameLen = 256
)

type event struct {
    Pid      uint32
    Ppid     uint32
    Comm     [taskCommLen]byte
    Filename [maxFilenameLen]byte
}

func main() {
    stopper := make(chan os.Signal, 1)
    signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)

    // Allow the current process to lock memory for eBPF maps.
    if err := unix.Setrlimit(unix.RLIMIT_MEMLOCK, &unix.Rlimit{Cur: unix.RLIM_INFINITY, Max: unix.RLIM_INFINITY}); err != nil {
        log.Fatalf("failed to set rlimit: %v", err)
    }

    // Load pre-compiled programs and maps into the kernel.
    objs := bpfObjects{}
    if err := loadBpfObjects(&objs, nil); err != nil {
        log.Fatalf("loading objects: %v", err)
    }
    defer objs.Close()

    // Attach the kprobe.
    kp, err := link.Kprobe("__x64_sys_execve", objs.HandleExecve, nil)
    if err != nil {
        log.Fatalf("attaching kprobe: %v", err)
    }
    defer kp.Close()

    log.Println("eBPF program attached. Waiting for events... Press Ctrl+C to exit.")

    // Open a ring buffer reader from user space.
    rd, err := ringbuf.NewReader(objs.Rb)
    if err != nil {
        log.Fatalf("opening ringbuf reader: %v", err)
    }
    defer rd.Close()

    // Goroutine to handle process termination
    go func() {
        <-stopper
        log.Println("Received signal, exiting...")
        rd.Close()
    }()

    var ev event
    for {
        record, err := rd.Read()
        if err != nil {
            // Ringbuf is closed, probably due to program termination.
            if err == ringbuf.ErrClosed {
                return
            }
            log.Printf("reading from reader: %s", err)
            continue
        }

        // Parse the raw data into our event struct.
        if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &ev); err != nil {
            log.Printf("parsing ringbuf event: %s", err)
            continue
        }

        log.Printf("PID: %d, PPID: %d, Command: %s, Filename: %s",
            ev.Pid,
            ev.Ppid,
            unix.ByteSliceToString(ev.Comm[:]),
            unix.ByteSliceToString(ev.Filename[:]),
        )
    }
}

To run this:

  • You need clang, llvm, and libbpf-dev installed.
  • Generate the kernel headers for CO-RE: bpftool btf dump file /sys/kernel/btf/vmlinux format c > headers/vmlinux.h.
  • Run go generate to compile the C code and embed it into a Go file.
  • Run sudo go run . (root privileges are required to load eBPF programs).
  • Now, if you execute any command in another terminal (e.g., ls /tmp), you will see the corresponding event logged by our Go program.


    Production Pattern: Correlating eBPF Events with Kubernetes Metadata

    An event with a PID is useless in a Kubernetes context. We need to know which pod, namespace, and deployment this process belongs to. Simply querying the Docker/containerd daemon for every PID is inefficient and prone to race conditions. The robust solution is to build an in-memory cache by watching the Kubernetes API server.

    Our user-space controller will now have two primary functions:

  • Watch the Kubernetes API for pod creation/deletion events to maintain a ContainerID -> PodMetadata cache.
    • Process eBPF events and enrich them using this cache.

    We need a way to link a PID from an eBPF event to a container ID. The most reliable way is via the Cgroup ID. We'll modify our eBPF program to capture the Cgroup ID for each event.

    1. Updated eBPF Program with Cgroup ID

    pod_monitor.bpf.c (updated handle_execve)

    c
    // Add cgroup_id to the event struct
    struct event {
        __u64 cgroup_id;
        __u32 pid;
        __u32 ppid;
        char comm[TASK_COMM_LEN];
        char filename[MAX_FILENAME_LEN];
    };
    
    // ... (rest of the file is the same)
    
    SEC("kprobe/__x64_sys_execve")
    int BPF_KPROBE(handle_execve, const struct pt_regs *regs)
    {
        // ... (previous variable declarations)
    
        // Get Cgroup ID
        e->cgroup_id = bpf_get_current_cgroup_id();
    
        // ... (rest of the function is the same)
        
        bpf_ringbuf_submit(e, 0);
        return 0;
    }

    2. Enhanced Go Controller with K8s API Watcher

    We will use the official client-go library to interact with the Kubernetes API.

    k8s_enricher.go

    go
    package main
    
    import (
    	"context"
    	"log"
    	"path/filepath"
    	"regexp"
    	"sync"
    
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    	"k8s.io/client-go/informers"
    	"k8s.io/client-go/kubernetes"
    	"k8s.io/client-go/tools/cache"
    	"k8s.io/client-go/tools/clientcmd"
    	"k8s.io/client-go/util/homedir"
    )
    
    type PodMetadata struct {
    	Name      string
    	Namespace string
    	NodeName  string
    }
    
    // CgroupManager is a thread-safe cache for mapping cgroup IDs to pod metadata.
    type CgroupManager struct {
    	mu    sync.RWMutex
    	cache map[uint64]PodMetadata
    }
    
    func NewCgroupManager() *CgroupManager {
    	return &CgroupManager{
    		cache: make(map[uint64]PodMetadata),
    	}
    }
    
    func (cm *CgroupManager) Get(cgroupID uint64) (PodMetadata, bool) {
    	cm.mu.RLock()
    	defer cm.mu.RUnlock()
    	meta, found := cm.cache[cgroupID]
    	return meta, found
    }
    
    // The core challenge: parsing the cgroup path to find the container ID,
    // then mapping that to a cgroup inode ID.
    // This is highly dependent on the container runtime's cgroup driver (systemd vs cgroupfs).
    // For cgroupfs on Docker/containerd, paths look like:
    // /kubepods/burstable/pod<POD_UID>/<CONTAINER_ID>
    var podCgroupRegex = regexp.MustCompile(`pod([a-f0-9\-]+)`) 
    
    func (cm *CgroupManager) StartInformer(ctx context.Context, clientset *kubernetes.Clientset, nodeName string) {
    	factory := informers.NewSharedInformerFactoryWithOptions(clientset, 0, informers.WithTweakListOptions(func(options *metav1.ListOptions) {
    		options.FieldSelector = "spec.nodeName=" + nodeName
    	}))
    	podInformer := factory.Core().V1().Pods().Informer()
    
    	podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
    		AddFunc: func(obj interface{}) {
    			// Logic to add pod and its containers to the cache
                // You would read the cgroup ID from /sys/fs/cgroup/... for each container
                // This is a complex part involving filesystem interaction and is omitted for brevity.
                // The key is to map the container's cgroup inode number to the pod metadata.
    		},
    		DeleteFunc: func(obj interface{}) {
    			// Logic to remove pod and its containers from the cache
    		},
    	})
    
    	log.Println("Starting Kubernetes informer...")
    	factory.Start(ctx.Done())
    	factory.WaitForCacheSync(ctx.Done())
    	log.Println("Kubernetes informer synced.")
    }
    
    // main function would be updated to initialize this
    func main_with_k8s() { // conceptual
    	nodeName := os.Getenv("NODE_NAME")
    	if nodeName == "" {
    		log.Fatal("NODE_NAME environment variable not set")
    	}
    
    	// K8s client setup
    	config, err := clientcmd.BuildConfigFromFlags("", filepath.Join(homedir.HomeDir(), ".kube", "config"))
    	if err != nil {
    		log.Fatal("failed to build kubeconfig")
    	}
    	clientset, err := kubernetes.NewForConfig(config)
    	if err != nil {
    		log.Fatal("failed to create clientset")
    	}
    
    	cgroupManager := NewCgroupManager()
    	ctx, cancel := context.WithCancel(context.Background())
    	defer cancel()
    
    	go cgroupManager.StartInformer(ctx, clientset, nodeName)
    
    	// ... eBPF loading and event loop from previous example ...
    
    	// Inside the event loop:
    	// var ev event (with CgroupID)
    	// ... read event from ring buffer ...
    	
    	if meta, found := cgroupManager.Get(ev.CgroupID); found {
    		log.Printf("Pod: %s/%s, PID: %d, Command: %s, Filename: %s",
    			meta.Namespace, meta.Name, ev.Pid, /* command */, /* filename */)
    	} else {
    		// Event from a process not in a tracked pod (e.g., host process)
    		log.Printf("Host Process Event: PID: %d, ...", ev.Pid)
    	}
    }

    Edge Case: The Cgroup ID Mapping Problem

    This is the hardest part of the implementation. The bpf_get_current_cgroup_id() helper returns the inode number of the cgroup directory the process belongs to. The user-space controller must:

    • Watch for new pods on its node.
  • For each container in a new pod, find its cgroup path in /sys/fs/cgroup/.
  • Get the inode number of that directory using a stat syscall.
  • Store a mapping: inode_number -> PodMetadata.
  • This is complex because the cgroup path structure depends on the CRI (Containerd, CRI-O) and the cgroup driver (systemd vs. cgroupfs). A production-ready implementation needs to handle these variations. It also needs to handle container restarts within a pod, which might get a new cgroup but belong to the same pod.


    Advanced Use Case: Detecting Malicious File Access with `tracepoints`

    Monitoring process execution is good, but monitoring file access is better for detecting threats like reading sensitive files (/etc/shadow) or writing to unexpected locations. For this, tracepoints are superior to kprobes. They are stable API points in the kernel, meaning they won't break with minor kernel updates.

    We'll hook the sys_enter_openat tracepoint.

    pod_monitor.bpf.c (additional program)

    c
    // ... existing event struct and ring buffer map ...
    
    // Add a new event type to distinguish between exec and open
    enum event_type {
        EVENT_EXEC,
        EVENT_OPEN,
    };
    
    struct event {
        __u64 cgroup_id;
        __u32 pid;
        enum event_type type;
        char comm[TASK_COMM_LEN];
        char filename[MAX_FILENAME_LEN];
    };
    
    // Tracepoint for sys_enter_openat
    // The context `struct trace_event_raw_sys_enter* ctx` is provided by the kernel
    // and its structure is defined in vmlinux.h.
    SEC("tracepoint/syscalls/sys_enter_openat")
    int handle_openat(struct trace_event_raw_sys_enter* ctx)
    {
        struct event *e;
        const char *filename_ptr;
    
        // Reserve space
        e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
        if (!e) {
            return 0;
        }
    
        // Populate common fields
        e->cgroup_id = bpf_get_current_cgroup_id();
        e->pid = bpf_get_current_pid_tgid() >> 32;
        e->type = EVENT_OPEN;
        bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
        // The filename is the second argument to openat.
        // We access it via the tracepoint context.
        filename_ptr = (const char *)ctx->args[1];
        bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
    
        // Submit event
        bpf_ringbuf_submit(e, 0);
    
        return 0;
    }

    Your Go application would now need to:

  • Load and attach this new handle_openat program to its tracepoint.
  • Update the event struct to include the type field.
    • In the event processing loop, switch on the event type to handle exec and open events differently.
  • Implement a policy engine. For example: IF event.type == EVENT_OPEN AND pod.name CONTAINS 'nginx' AND event.filename == '/etc/shadow' THEN ALERT.

  • Performance Considerations & Production Hardening

    Running eBPF programs on every syscall can have performance implications if not handled carefully.

  • In-Kernel Filtering: The single most important optimization is to avoid sending unnecessary data to user space. The kernel-userspace boundary crossing is expensive. If you only care about processes running inside containers, you can filter out all host processes directly in your eBPF program.
  • * Strategy: Your user-space program can create an eBPF map (e.g., a hash map) and populate it with the cgroup IDs of all containers it's monitoring. The eBPF program then checks if the current process's cgroup ID exists in this map before sending an event.

    * eBPF Code Snippet:

    c
            struct {
                __uint(type, BPF_MAP_TYPE_HASH);
                __uint(max_entries, 8192);
                __type(key, __u64);
                __type(value, __u8);
            } monitored_cgroups SEC(".maps");
    
            // In your kprobe/tracepoint:
            __u64 cgroup_id = bpf_get_current_cgroup_id();
            __u8 *is_monitored = bpf_map_lookup_elem(&monitored_cgroups, &cgroup_id);
            if (!is_monitored) {
                return 0; // Not a cgroup we care about, drop the event.
            }
            // ... proceed to send event
  • Ring Buffer vs. Perf Buffer: As mentioned, BPF_MAP_TYPE_RINGBUF is generally superior for high-volume event sources. It provides a per-CPU buffer that the user-space program can read from. It guarantees that events are not dropped (unlike perf buffers, which can overwrite old events under pressure) and avoids the expensive per-event wakeup notifications that perf buffers require.
  • CO-RE and BTF: Relying on BTF (BPF Type Format) is crucial for production. Without it, your eBPF program is compiled against the specific kernel headers of your build machine. If you deploy it to a node with a different kernel version, it will fail to load because struct offsets or field names might have changed. BTF embeds type information into the eBPF object file, allowing the cilium/ebpf library to perform runtime relocations, making your program portable across a fleet of machines with varying kernel versions.

  • Deployment as a Kubernetes DaemonSet

    This tool is designed to run on every node in the cluster. The natural Kubernetes construct for this is a DaemonSet.

    daemonset.yaml

    yaml
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: ebpf-pod-monitor
      namespace: kube-system
      labels:
        app: ebpf-pod-monitor
    spec:
      selector:
        matchLabels:
          app: ebpf-pod-monitor
      template:
        metadata:
          labels:
            app: ebpf-pod-monitor
        spec:
          hostPID: true # Required to see all PIDs on the host
          tolerations:
          - operator: Exists
          serviceAccountName: ebpf-monitor-sa
          containers:
          - name: monitor
            image: your-repo/ebpf-pod-monitor:latest
            env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            securityContext:
              # This is the most critical and dangerous part. 
              # CAP_BPF and CAP_PERFMON are needed to load and manage eBPF programs.
              # CAP_SYS_ADMIN is a broad capability often required for certain helpers.
              # Running as privileged is the easiest way but least secure.
              privileged: true
            volumeMounts:
            - name: bpf-fs
              mountPath: /sys/fs/bpf
              readOnly: false
          volumes:
          - name: bpf-fs
            hostPath:
              path: /sys/fs/bpf
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: ebpf-monitor-sa
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: ebpf-monitor-role
    rules:
    - apiGroups: [""]
      resources: ["pods"]
      verbs: ["get", "list", "watch"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: ebpf-monitor-binding
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: ebpf-monitor-role
    subjects:
    - kind: ServiceAccount
      name: ebpf-monitor-sa
      namespace: kube-system

    Security Implications:

    Loading eBPF programs requires significant privileges. The privileged: true flag gives the container almost complete control over the host kernel, which is a major security risk. In newer kernels, you can scope this down significantly using capabilities like CAP_BPF, CAP_PERFMON, and CAP_SYS_ADMIN. The principle of least privilege is paramount. Any vulnerability in your user-space Go controller could be catastrophic if it's running as a privileged container.

    Conclusion

    eBPF represents a paradigm shift in kernel observability and runtime security. By moving detection logic from user-space agents directly into the kernel, we can build tools that are orders of magnitude more performant and comprehensive than their traditional counterparts. We've demonstrated a practical, albeit complex, path to building a Kubernetes-aware security monitor: hooking syscalls, enriching events with pod metadata, and considering the critical performance and deployment patterns required for a production environment. While the implementation details, particularly around cgroup and PID management, are non-trivial, the payoff is a level of visibility into your running workloads that was previously unattainable without significant performance compromises.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles