Advanced eBPF for Kubernetes: syscall Hooking for Runtime Security
The Observability Gap in Container Runtime Security
As senior engineers responsible for production Kubernetes clusters, we understand that static analysis, vulnerability scanning, and network policies are necessary but insufficient. The most insidious threats manifest at runtime—an exploited application executing a shell, a compromised process opening a reverse shell, or unauthorized file access within a container. Traditional host-based intrusion detection systems (HIDS) often struggle with the ephemeral and abstracted nature of containers, leading to a significant observability gap at the kernel level.
This is where the Extended Berkeley Packet Filter (eBPF) becomes a game-changer. It allows us to run sandboxed programs directly in the Linux kernel without changing kernel source code or loading kernel modules. For security, this means we can instrument kernel behavior, such as system calls (syscalls), at the source of truth with minimal performance overhead.
This article bypasses the basics. We assume you understand what eBPF is and why it's powerful. Our focus is on the practical, advanced implementation of a runtime security monitor for Kubernetes. We will build a system that:
tracepoints to hook the execve syscall, capturing every new process execution across the entire node.- Writes a portable, CO-RE (Compile Once - Run Everywhere) eBPF program in C.
- Develops a sophisticated user-space agent in Go that loads the eBPF program and consumes events.
- Solves the critical challenge of enriching raw kernel events (containing just a PID) with Kubernetes context (Namespace, Pod, Container).
- Implements advanced in-kernel filtering with BPF maps to dramatically reduce data volume and performance overhead.
- Packages the entire solution as a Kubernetes DaemonSet for cluster-wide deployment.
Environment and Tooling Prerequisites
This is not a step-by-step tutorial for setting up a development environment. We expect you have a working Linux environment (or VM) with a modern kernel (5.8+ recommended for stable features like ring buffers and full BTF support). The following tools are essential:
clang and llvm (v10+): Required for compiling C eBPF code into BPF bytecode.libbpf (development headers): The canonical library for interacting with the BPF subsystem from C. Our eBPF program will link against it.bpftool: The swiss-army knife for inspecting and managing BPF objects on the system.cilium/ebpf Go library: A powerful library for loading and interacting with eBPF programs from Go.minikube or kind) to deploy our final agent.Crucially, your kernel must be compiled with BTF (BPF Type Format) support. BTF embeds type information about the kernel into the kernel image itself, which is the cornerstone of CO-RE. You can check for its existence:
$ ls /sys/kernel/btf/vmlinux
/sys/kernel/btf/vmlinux
If this file exists, you're ready to build portable eBPF programs.
Section 1: The eBPF Program - Hooking `execve`
Our goal is to capture every attempt to execute a new program. The most reliable way to do this is by hooking the execve syscall. We'll use a tracepoint for this, as they provide a stable API compared to kprobes, which can be more fragile across kernel versions. The specific tracepoint is syscalls/sys_enter_execve.
Let's create bpf_program.c.
// bpf_program.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#define TASK_COMM_LEN 16
#define MAX_FILENAME_LEN 256
// Event structure sent to user space
struct event {
u32 pid;
u32 ppid;
u64 cgroup_id;
char comm[TASK_COMM_LEN];
char filename[MAX_FILENAME_LEN];
};
// BPF ring buffer
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");
// Optional: force BTF generation for our event struct
const struct event *unused __attribute__((unused));
SEC("tracepoint/syscalls/sys_enter_execve")
int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
u64 id = bpf_get_current_pid_tgid();
u32 pid = id >> 32;
// Get parent PID
struct task_struct *task = (struct task_struct *)bpf_get_current_task();
u32 ppid = BPF_CORE_READ(task, real_parent, tgid);
// Reserve space on the ring buffer
struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) {
return 0;
}
// Populate the event structure
e->pid = pid;
e->ppid = ppid;
e->cgroup_id = bpf_get_current_cgroup_id();
bpf_get_current_comm(&e->comm, sizeof(e->comm));
// Read the filename argument from user space
const char __user *filename_ptr = (const char __user *)BPF_CORE_READ(ctx, args[0]);
bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
// Submit the event to the ring buffer
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Advanced Implementation Details:
#include "vmlinux.h": This is the magic of CO-RE. Instead of including dozens of kernel headers, we generate a single header file containing all kernel type definitions from our system's BTF info. We generate it once with bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h.BPF_CORE_READ: This macro is a CO-RE helper that allows us to access kernel struct fields in a portable way. If a kernel version changes the struct layout, our BPF program doesn't need to be recompiled. Here, we use it to get the parent PID (real_parent->tgid) and the filename pointer (ctx->args[0]).BPF_MAP_TYPE_RINGBUF: We use a ring buffer map to send data to user space. Compared to the older BPF_MAP_TYPE_PERF_EVENT_ARRAY, ring buffers are more performant, memory-efficient, and guarantee event order without overwriting data under high load—a critical feature for a security monitor.bpf_probe_read_user_str: Reading data from user-space memory into the kernel context is a delicate operation. This helper safely copies the null-terminated filename string into our event struct, preventing kernel panics from faulty user-space pointers.bpf_get_current_cgroup_id(): This is the key to linking a kernel event to a container. The cgroup ID is a stable identifier for the control group a process belongs to. In Kubernetes, each container runs in its own cgroup, making this ID the perfect bridge between the kernel and container orchestrator context.To compile this, we use clang:
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I. -c bpf_program.c -o bpf_program.o
Section 2: The User-Space Go Controller
Now we need a user-space program to load, attach, and listen for events from our eBPF program. We will use the cilium/ebpf library, which provides excellent Go bindings for the libbpf API.
Here is a simplified main.go.
// main.go
package main
import (
"bytes"
"encoding/binary"
"log"
"os"
"os/signal"
"syscall"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/ringbuf"
"github.com/cilium/ebpf/rlimit"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf bpf_program.c -- -I.
// Event structure must match the C struct
type Event struct {
PID uint32
PPID uint32
CgroupID uint64
Comm [16]byte
Filename [256]byte
}
func main() {
// Subscribe to signals for graceful shutdown
stopper := make(chan os.Signal, 1)
signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
// Allow the current process to lock memory for eBPF maps
if err := rlimit.RemoveMemlock(); err != nil {
log.Fatal(err)
}
// Load pre-compiled BPF programs and maps into the kernel.
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading objects: %v", err)
}
defer objs.Close()
// Attach the tracepoint
tp, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.TracepointSyscallsSysEnterExecve, nil)
if err != nil {
log.Fatalf("attaching tracepoint: %v", err)
}
defer tp.Close()
log.Println("Successfully loaded and attached eBPF program. Waiting for events...")
// Open a ring buffer reader from user space.
rd, err := ringbuf.NewReader(objs.Rb)
if err != nil {
log.Fatalf("opening ringbuf reader: %s", err)
}
defer rd.Close()
// Close the reader when the process receives a signal, which will exit
// the loop.
go func() {
<-stopper
log.Println("Received signal, exiting...")
if err := rd.Close(); err != nil {
log.Fatalf("closing ringbuf reader: %s", err)
}
}()
var event Event
for {
record, err := rd.Read()
if err != nil {
if err == ringbuf.ErrClosed {
log.Println("Ring buffer closed")
return
}
log.Printf("error reading from ring buffer: %s", err)
continue
}
// Parse the raw data into our Go struct
if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
log.Printf("error parsing event: %s", err)
continue
}
// TODO: Enrich this event with Kubernetes context
log.Printf("PID: %d, PPID: %d, CgroupID: %d, Comm: %s, Filename: %s",
event.PID, event.PPID, event.CgroupID, bytes.TrimSpace(event.Comm[:]), bytes.TrimSpace(event.Filename[:]))
}
}
Code Breakdown:
//go:generate: This magic comment automates the process of converting our compiled BPF object file (bpf_program.o) into embeddable Go code. Running go generate will create bpf_bpfel_x86.go and bpf_bpfeb_x86.go, which contain the BPF bytecode and helper functions to load it.rlimit.RemoveMemlock(): eBPF requires locked memory to operate. This helper function removes the memory lock limit for our process, a necessary step before loading BPF objects.loadBpfObjects: This function is generated by bpf2go and handles loading the BPF bytecode and maps into the kernel.link.Tracepoint: This function from the cilium/ebpf library attaches our compiled eBPF program (objs.TracepointSyscallsSysEnterExecve) to the specified tracepoint.ringbuf.NewReader: We open a reader for the ring buffer map we defined in our C code. This is how we receive events.for loop continuously reads records from the ring buffer. We then parse the raw byte slice into our Go Event struct. The struct layout and field sizes must exactly match the C struct for this to work correctly.At this point, we have a functional, node-level process monitor. If you run this on a Linux host, you will see every command execution. But in Kubernetes, this output is nearly useless without context.
Section 3: The Kubernetes Context Enrichment Challenge
An event with PID: 12345, CgroupID: 5678, Filename: /bin/bash is meaningless without knowing it originated from Namespace: prod, Pod: api-server-xyz, Container: main.
This is the most complex part of building a production-grade monitor. We need to build a real-time cache that maps cgroup_id to Kubernetes metadata.
The strategy involves two parts:
/var/run/containerd/containerd.sock or /var/run/crio/crio.sock) holds the ground truth about which containers are running on that node and their cgroup paths.Here's an advanced pattern for building this enrichment service in Go.
// Part of main.go - enrichment logic
// A simplified cache structure
type PodInfo struct {
Namespace string
PodName string
ContainerName string
}
// cgroup ID -> PodInfo
var enrichmentCache = make(map[uint64]PodInfo)
var cacheLock sync.RWMutex
// This function would run in a separate goroutine to periodically sync.
func syncEnrichmentCache() {
// 1. Connect to the local CRI socket (e.g., containerd)
// This requires CRI client libraries.
// Pseudocode:
// conn, err := grpc.Dial("unix:///run/containerd/containerd.sock", ...)
// criClient := runtimeapi.NewRuntimeServiceClient(conn)
// containers, err := criClient.ListContainers(ctx, &runtimeapi.ListContainersRequest{})
// 2. For each container from CRI:
// - Get its cgroup path from the container status.
// - Extract the cgroup ID from the path.
// The path looks like: /kubepods/burstable/pod<POD_UID>/<CONTAINER_ID>
// We need to open this directory and read the cgroup.id file.
// - Get the Pod UID and Container Name from the container's labels.
// 3. Connect to the local Kubelet's Pods API (https://<node-ip>:10250/pods)
// This gives a list of all pods running on this node.
// Pseudocode:
// resp, err := kubeletClient.Get("https://localhost:10250/pods")
// 4. Correlate: Match the Pod UID from CRI with the Pod metadata from Kubelet.
// Build the PodInfo struct.
// 5. Update the cache
// cacheLock.Lock()
// enrichmentCache[cgroupID] = podInfo
// cacheLock.Unlock()
}
// Inside the main event loop:
func enrichAndLog(event Event) {
cacheLock.RLock()
info, found := enrichmentCache[event.CgroupID]
cacheLock.RUnlock()
if !found {
log.Printf("[Unenriched] PID: %d, CgroupID: %d, Comm: %s, Filename: %s",
event.PID, event.CgroupID, bytes.TrimSpace(event.Comm[:]), bytes.TrimSpace(event.Filename[:]))
return
}
log.Printf("[Enriched] Namespace: %s, Pod: %s, Container: %s, PID: %d, Comm: %s, Filename: %s",
info.Namespace, info.PodName, info.ContainerName, event.PID, bytes.TrimSpace(event.Comm[:]), bytes.TrimSpace(event.Filename[:]))
}
// In main(), after starting the eBPF listener:
go func() {
ticker := time.NewTicker(30 * time.Second) // Sync every 30 seconds
defer ticker.Stop()
for {
syncEnrichmentCache()
<-ticker.C
}
}()
// The main loop now calls enrichAndLog(event)
Edge Cases and Production Considerations:
* Race Conditions: A container might start and stop between sync intervals. The initial event will be unenriched. A more robust solution involves watching the CRI/Kubelet streams instead of periodic polling.
* Performance: The sync process can be resource-intensive. It must be optimized to only update deltas.
* Cgroup v1 vs v2: The cgroup path format and how you get the ID can differ. The code must handle both.
* Authentication: Accessing the Kubelet API requires proper authentication, typically via a ServiceAccount token mounted into the agent's Pod.
Section 4: Performance Optimization with In-Kernel Filtering
Our current implementation sends every single execve event to user space. In a busy cluster, this can be millions of events per minute, creating significant CPU overhead in our Go agent and flooding logs. The eBPF philosophy is to push as much logic as possible into the kernel.
Let's implement a policy: only notify user space if a process not in a predefined allowlist is executed. We'll use a BPF_MAP_TYPE_HASH for this.
First, update bpf_program.c:
// Add to bpf_program.c
// Map to store allowed binary paths (e.g., "/usr/bin/ls")
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1024);
__type(key, char[MAX_FILENAME_LEN]);
__type(value, u8);
} allowlist SEC(".maps");
SEC("tracepoint/syscalls/sys_enter_execve")
int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
char filename[MAX_FILENAME_LEN];
const char __user *filename_ptr = (const char __user *)BPF_CORE_READ(ctx, args[0]);
long ret = bpf_probe_read_user_str(&filename, sizeof(filename), filename_ptr);
if (ret < 0) {
return 0; // Or handle error
}
// Check if the filename exists in our allowlist map
if (bpf_map_lookup_elem(&allowlist, &filename)) {
// It's allowed, so we do nothing. Exit immediately.
return 0;
}
// If we are here, it's NOT in the allowlist. Proceed to send the event.
u64 id = bpf_get_current_pid_tgid();
// ... rest of the logic from before ...
struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) {
return 0;
}
// ... populate and submit event ...
// Note: we need to copy the filename we already read
bpf_probe_read_kernel_str(&e->filename, sizeof(e->filename), filename);
bpf_ringbuf_submit(e, 0);
return 0;
}
Now, our Go agent is responsible for populating this map.
// Add to main.go
func populateAllowlist(allowlistMap *ebpf.Map) {
allowedBinaries := []string{
"/usr/bin/ls",
"/usr/bin/cat",
"/bin/sh",
// ... add binaries from a trusted base image
}
var value uint8 = 1
for _, bin := range allowedBinaries {
key := make([]byte, 256) // MAX_FILENAME_LEN
copy(key, bin)
if err := allowlistMap.Put(key, value); err != nil {
log.Printf("Failed to update allowlist map for %s: %v", bin, err)
}
}
log.Println("Allowlist map populated.")
}
// In main(), after loading objects:
populateAllowlist(objs.Allowlist)
This pattern is incredibly powerful. The performance difference is night and day. The kernel now filters the vast majority of benign events, and our user-space agent only wakes up to process genuinely anomalous behavior. A real-world policy could be loaded from a Kubernetes ConfigMap and dynamically updated.
Benchmark Consideration
A naive execve hook might add 3-5µs of overhead to every execution. On a system with 10,000 execve calls per second, that's 3-5% of a single CPU core just for instrumentation. With in-kernel filtering, if 99.9% of calls are on the allowlist, the overhead drops to near zero for those calls, as the bpf_map_lookup_elem is extremely fast (hash map lookup). The total CPU usage of the eBPF program can decrease by orders of magnitude.
Section 5: Packaging as a Kubernetes DaemonSet
To deploy our agent across the cluster, a DaemonSet is the perfect tool. It ensures one instance of our agent pod runs on every node.
Here's a daemonset.yaml manifest:
# daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebpf-security-agent
namespace: kube-system
labels:
app: ebpf-security-agent
spec:
selector:
matchLabels:
app: ebpf-security-agent
template:
metadata:
labels:
app: ebpf-security-agent
spec:
# Run on the host's network and PID namespace to see all processes
hostNetwork: true
hostPID: true
# A service account is needed for Kubelet/CRI access
serviceAccountName: ebpf-agent-sa
tolerations:
- operator: Exists
containers:
- name: agent
image: your-repo/ebpf-agent:latest # Your container image
command: ["/ebpf-agent"]
securityContext:
# Essential: We need privileges to load eBPF programs
privileged: true
volumeMounts:
# Mount the BPF filesystem
- name: bpf-fs
mountPath: /sys/fs/bpf
# Mount kernel debug filesystem for tracepoints
- name: kernel-debug
mountPath: /sys/kernel/debug
# Mount the CRI socket for container enrichment
- name: containerd-sock
mountPath: /run/containerd/containerd.sock
volumes:
- name: bpf-fs
hostPath:
path: /sys/fs/bpf
- name: kernel-debug
hostPath:
path: /sys/kernel/debug
- name: containerd-sock
hostPath:
path: /run/containerd/containerd.sock
Critical `DaemonSet` Details:
* hostPID: true: This allows our agent to see all process IDs on the host, not just those inside its own PID namespace.
* privileged: true: This is the sledgehammer approach. It's required to get the CAP_BPF and CAP_SYS_ADMIN capabilities needed to load eBPF programs and interact with the kernel. In a production environment, you should aim to drop all capabilities except the ones you strictly need.
* volumeMounts: We must mount /sys/fs/bpf and /sys/kernel/debug from the host into the container. libbpf uses these filesystems to pin maps and attach programs. We also mount the CRI socket for our enrichment logic.
Conclusion: Beyond a Simple Monitor
We have successfully architected and implemented an advanced eBPF-based runtime security monitor for Kubernetes. We've moved far beyond a simple "hello world" by tackling the essential production challenges: achieving kernel-to-Kubernetes context awareness, optimizing performance with in-kernel aggregation, and packaging the solution for cluster-wide deployment.
The real power of this approach is its extensibility. This framework can be expanded to monitor other high-risk syscalls (connect, openat, bpf), track network connections, or even enforce security policies by blocking syscalls using seccomp or LSM hooks triggered by eBPF. The combination of kernel-level primitives from eBPF and the contextual awareness of the Kubernetes control plane provides a foundation for building truly next-generation cloud-native security tooling.