Leveraging eBPF for Kernel-Level Runtime Security in Kubernetes
The Observability Gap in Ephemeral Infrastructure
In modern Kubernetes environments, traditional security paradigms fall short. Static container scanning, network policies, and admission controllers are necessary but insufficient layers of defense. They operate before runtime and lack visibility into the actual behavior of a process once it's running. What happens when a zero-day vulnerability in a web server allows an attacker to spawn a shell? Your static analysis is useless, and your network policies might not detect the command-and-control traffic if it tunnels over HTTP/S. This is the runtime security problem, and its solution lies directly within the Linux kernel.
Enter eBPF (extended Berkeley Packet Filter). eBPF is a revolutionary technology that allows us to run sandboxed programs within the kernel without changing kernel source code or loading kernel modules. For security engineers, this is the holy grail: the ability to observe and even control system behavior at the most fundamental level, with minimal performance overhead.
This article is not an introduction to eBPF. It assumes you understand what eBPF is and why it's powerful. Instead, we will focus on a critical, production-level implementation pattern: building a targeted, K8s-aware syscall monitor that uses efficient kernel-side filtering to minimize data overhead. We will build a tool that can detect when a process inside a specific, labeled pod executes a new program (execve
syscall), and we'll do it in a way that is scalable and performant enough for a production cluster.
Section 1: Architecture and Toolchain for Production eBPF
Before writing code, we must architect our solution and choose the right tools. A naive eBPF tool might attach to a syscall and dump every event to user space for filtering. This is untenable in production; a single execve
tracepoint on a busy node can generate thousands of events per second from legitimate system activity, creating massive CPU and I/O load.
Our architecture will be smarter:
security.monitor=true
).BPF_MAP_TYPE_HASH
.execve
syscall tracepoint will fire on every execution. Crucially, its first action will be to check if the process's cgroup ID exists in our hash map. If not, it exits immediately. This is our kernel-side filter.BPF_MAP_TYPE_RINGBUF
, a modern, high-performance, and lossless mechanism for kernel-to-user-space communication.Toolchain: `libbpf-bootstrap` over BCC
For production eBPF, the choice between the BPF Compiler Collection (BCC) and a libbpf
-based approach is critical. BCC is excellent for ad-hoc debugging and scripting, as it embeds a C-to-BPF compiler. However, this means you must ship the entire LLVM/Clang toolchain to your production nodes, a significant dependency and potential attack surface.
We will use the libbpf-bootstrap
model. This approach leverages CO-RE (Compile Once – Run Everywhere). We compile our eBPF C code into a compact object file on a build machine. This object file contains BPF bytecode and relocation information derived from BTF (BPF Type Format), which describes kernel types. The libbpf
library on the target node uses this information to adapt the program to the running kernel's specific memory layouts and offsets. This results in a small, self-contained binary with no runtime compilation dependencies.
Our project structure will look like this:
/ebpf-security-monitor
|-- go.mod
|-- main.go # User-space Go controller/agent
|-- bpf/
| |-- monitor.bpf.c # The eBPF C code
| |-- monitor.bpf.h # Shared structs between C and Go
| |-- vmlinux.h # Kernel type definitions for CO-RE
|-- Makefile # To compile the eBPF C code
Section 2: Implementing the eBPF Kernel Program
Let's write the heart of our monitor: the eBPF C code. This code will be compiled into an ELF object file and loaded by our Go application.
First, we need a header file (bpf/monitor.bpf.h
) to define the data structure for events we send to user space. Sharing this definition ensures type safety between the kernel and user space.
bpf/monitor.bpf.h
#ifndef __MONITOR_BPF_H
#define __MONITOR_BPF_H
#define TASK_COMM_LEN 16
#define FILENAME_LEN 256
struct event {
__u32 pid;
__u64 cgroup_id;
char comm[TASK_COMM_LEN];
char filename[FILENAME_LEN];
int retval;
};
#endif /* __MONITOR_BPF_H */
Now for the main eBPF program. This file defines our maps and the tracepoint logic.
bpf/monitor.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "monitor.bpf.h"
// Ring buffer for sending events to user space
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");
// Hash map for filtering by cgroup ID. User-space populates this.
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10240);
__type(key, __u64);
__type(value, __u8);
} monitored_cgroups SEC(".maps");
// This tracepoint is more reliable than a kprobe on do_execve
// as its API is stable across kernel versions.
SEC("tracepoint/syscalls/sys_enter_execve")
int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
// 1. Get cgroup ID for the current process
__u64 cgroup_id = bpf_get_current_cgroup_id();
// 2. Perform the kernel-side filter check
// bpf_map_lookup_elem is a highly efficient hash table lookup.
void *is_monitored = bpf_map_lookup_elem(&monitored_cgroups, &cgroup_id);
if (!is_monitored) {
return 0; // Not a monitored cgroup, exit immediately.
}
// 3. Reserve space on the ring buffer for our event
struct event *e;
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) {
return 0; // Failed to reserve space, maybe buffer is full.
}
// 4. Populate the event with data
__u64 id = bpf_get_current_pid_tgid();
e->pid = (__u32)id;
e->cgroup_id = cgroup_id;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
// 5. Read the filename argument from user-space memory
// This is a tricky and potentially unsafe operation if not done carefully.
const char __user *filename_ptr = (const char __user *)ctx->args[0];
long ret = bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
if (ret < 0) {
// Error reading string, maybe invalid pointer. We still send the event
// but with a truncated or empty filename to signal the issue.
bpf_ringbuf_discard(e, 0);
return 1;
}
// 6. Submit the event to user space
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Key Decisions in the eBPF Code:
* tracepoint/syscalls/sys_enter_execve
: We use a tracepoint instead of a kprobe
. Tracepoints are stable API points in the kernel, making our program more robust across kernel updates. Kprobes attach to function names which can change.
* bpf_map_lookup_elem
: This is the core of our filtering logic. The check happens entirely in the kernel. If the cgroup ID of the process executing execve
is not in our monitored_cgroups
map, the program exits, having consumed negligible CPU cycles.
* bpf_ringbuf_reserve
/submit
: We use a ring buffer. Unlike the older perf buffer, it guarantees event order, prevents data overwriting (events are dropped if the buffer is full, but existing ones are safe), and is generally more efficient for high-throughput event streaming.
bpf_probe_read_user_str
Error Handling: Reading from user-space memory is a privileged operation that can fail (e.g., page fault). We must* check the return value. Here, we discard the event if the read fails, preventing corrupted data from reaching our agent. A more advanced implementation might send an error event.
To compile this, we need a Makefile
and to generate vmlinux.h
.
Makefile
.PHONY: all clean
# BPF object file
BPF_OBJ = bpf/monitor.bpf.o
# User-space binary
GO_BIN = ebpf-monitor
# Tools
CLANG = clang
GO = go
BPFTOOL = bpftool
all: $(GO_BIN)
# Compile Go binary
$(GO_BIN): main.go $(BPF_OBJ)
$(GO) build -o $(GO_BIN) main.go
# Generate skeletons and compile BPF C code
$(BPF_OBJ): bpf/monitor.bpf.c bpf/vmlinux.h
$(CLANG) -g -O2 -target bpf -c bpf/monitor.bpf.c -o $(BPF_OBJ)
# Generate vmlinux.h for CO-RE
# This needs to be run once on a machine with kernel debug symbols
# (e.g., via `bpftool btf dump file /sys/kernel/btf/vmlinux format c > bpf/vmlinux.h`)
bpf/vmlinux.h:
@echo "Downloading vmlinux.h from btfhub..."
@curl -L https://github.com/aquasecurity/btfhub-archive/raw/main/$(shell uname -m)/$(shell uname -r | sed 's/[^0-9a-zA-Z.-]/_/g').btf.tar.xz | tar -xJO > bpf/vmlinux.h
clean:
rm -f $(GO_BIN) $(BPF_OBJ)
Note: The Makefile includes a command to download a pre-generated vmlinux.h
for your kernel version from the btfhub-archive
, simplifying the setup.
Section 3: The User-space Go Controller and Agent
Now we'll build the Go application that orchestrates the entire process. It will use the cilium/ebpf
library for interacting with the eBPF subsystem and the official Kubernetes Go client.
main.go
package main
import (
"bytes"
"context"
"encoding/binary"
"errors"
"fmt"
"log"
"os"
"os/signal"
"regexp"
"strconv"
"strings"
"syscall"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/ringbuf"
"k8s.io/api/core/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall" bpf ./bpf/monitor.bpf.c -- -I./bpf
const (
MONITOR_LABEL = "security.monitor"
)
// cgroupV2IDRegex is used to extract the cgroup ID from /proc/[pid]/cgroup on cgroup v2 systems.
var cgroupV2IDRegex = regexp.MustCompile(`^0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod[a-f0-9_]+\.slice/cri-containerd-([a-f0-9]+)\.scope$`)
func main() {
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
// Load the compiled eBPF objects
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading bpf objects: %v", err)
}
defer objs.Close()
// Attach the tracepoint
tp, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.TracepointSyscallsSysEnterExecve, nil)
if err != nil {
log.Fatalf("attaching tracepoint: %v", err)
}
defer tp.Close()
log.Println("eBPF programs loaded and attached.")
// Set up Kubernetes client
k8sClient, err := newKubernetesClient()
if err != nil {
log.Fatalf("creating k8s client: %v", err)
}
// Start the pod monitor goroutine
go monitorPods(ctx, k8sClient, objs.MonitoredCgroups)
// Set up the ring buffer reader
rd, err := ringbuf.NewReader(objs.Rb)
if err != nil {
log.Fatalf("creating ringbuf reader: %v", err)
}
defer rd.Close()
// Start the event reading loop
log.Println("Waiting for events...")
var event bpfEvent
for {
if ctx.Err() != nil {
return
}
record, err := rd.Read()
if err != nil {
if errors.Is(err, ringbuf.ErrClosed) {
log.Println("Received signal, exiting...")
return
}
log.Printf("reading from reader: %s", err)
continue
}
if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
log.Printf("parsing ringbuf event: %s", err)
continue
}
comm := string(event.Comm[:bytes.IndexByte(event.Comm[:], 0)])
filename := string(event.Filename[:bytes.IndexByte(event.Filename[:], 0)])
log.Printf("ALERT: Process '%s' (PID: %d) executed '%s' in a monitored pod (cgroup: %d)", comm, event.Pid, filename, event.CgroupId)
}
}
func newKubernetesClient() (*kubernetes.Clientset, error) {
config, err := rest.InClusterConfig()
if err != nil {
return nil, fmt.Errorf("getting in-cluster config: %w", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
return nil, fmt.Errorf("creating clientset: %w", err)
}
return clientset, nil
}
// monitorPods watches for pods with the security.monitor label and updates the eBPF map.
func monitorPods(ctx context.Context, clientset *kubernetes.Clientset, cgroupsMap *ebpf.Map) {
watcher, err := clientset.CoreV1().Pods("").Watch(ctx, metav1.ListOptions{LabelSelector: MONITOR_LABEL + "=true"})
if err != nil {
log.Fatalf("failed to watch pods: %v", err)
}
log.Println("Watching for pods with label 'security.monitor=true'")
for event := range watcher.ResultChan() {
pod, ok := event.Object.(*v1.Pod)
if !ok {
continue
}
cgroupID, err := getCgroupIDForPod(pod)
if err != nil {
log.Printf("could not get cgroup id for pod %s/%s: %v", pod.Namespace, pod.Name, err)
continue
}
key := cgroupID
value := uint8(1)
switch event.Type {
case "ADDED", "MODIFIED":
if err := cgroupsMap.Put(&key, &value); err != nil {
log.Printf("failed to update cgroups map for pod %s: %v", pod.Name, err)
}
log.Printf("Started monitoring pod: %s/%s (cgroup ID: %d)", pod.Namespace, pod.Name, cgroupID)
case "DELETED":
if err := cgroupsMap.Delete(&key); err != nil {
log.Printf("failed to delete from cgroups map for pod %s: %v", pod.Name, err)
}
log.Printf("Stopped monitoring pod: %s/%s (cgroup ID: %d)", pod.Namespace, pod.Name, cgroupID)
}
}
}
// getCgroupIDForPod is a complex and fragile part of the system.
// This is a simplified example for cgroup v2 and a specific CRI layout.
// Production systems need to handle cgroup v1 and different CRI runtimes.
func getCgroupIDForPod(pod *v1.Pod) (uint64, error) {
if len(pod.Status.ContainerStatuses) == 0 {
return 0, fmt.Errorf("pod has no container statuses")
}
containerID := pod.Status.ContainerStatuses[0].ContainerID
if containerID == "" {
return 0, fmt.Errorf("container ID is empty")
}
// Example format: containerd://<id>
parts := strings.Split(containerID, "//")
if len(parts) != 2 {
return 0, fmt.Errorf("malformed containerID: %s", containerID)
}
shortID := parts[1][:12] // Use a shortened ID for path matching
// This is a massive simplification. In reality, you'd walk the cgroupfs.
// Here we just construct an expected path and get its handle.
// This path is highly specific to this environment setup.
podUID := strings.ReplaceAll(string(pod.UID), "-", "_")
path := fmt.Sprintf("/sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod%s.slice/cri-containerd-%s.scope", podUID, shortID)
// Use open and stat to get the handle.
f, err := os.Open(path)
if err != nil {
// Fallback for different QoS classes
path = fmt.Sprintf("/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod%s.slice/cri-containerd-%s.scope", podUID, shortID)
f, err = os.Open(path)
if err != nil {
return 0, fmt.Errorf("cannot find cgroup path for container %s", shortID)
}
}
defer f.Close()
var stat syscall.Stat_t
if err := syscall.Fstat(int(f.Fd()), &stat); err != nil {
return 0, fmt.Errorf("fstat failed: %w", err)
}
// On cgroup2, the inode number of the cgroup directory is often used as the ID.
// A more robust way involves using file handles, but this is a common method.
return stat.Ino, nil
}
Analysis of the Go Controller:
* bpf2go
: This command from the cilium/ebpf
library is crucial. It processes our C file, generates a Go file (bpf_bpfel_x86.go
) containing the compiled BPF bytecode, and creates Go structs that map directly to our BPF maps and programs. This is the glue that enables CO-RE.
* monitorPods
Goroutine: This is our K8s controller. It uses a Watch
on pods with the security.monitor=true
label. When a pod is added, it calls getCgroupIDForPod
and updates the monitored_cgroups
eBPF map. When a pod is deleted, it removes the entry. This ensures our kernel filter is always synchronized with the desired state in Kubernetes.
* The getCgroupIDForPod
Problem: This function highlights a major real-world challenge. There is no simple, direct way to map a Kubernetes Pod to a cgroup ID. The implementation here is a fragile heuristic based on parsing cgroupfs paths, which differ between cgroup v1/v2, CRI runtimes (containerd, cri-o), and Kubernetes QoS classes. A production-grade system like Falco has extensive, complex logic to handle this robustly. We use a simplified fstat
approach to get the inode number, which often serves as the cgroup ID in modern kernels.
* Event Loop: The main
function's primary loop blocks on rd.Read()
. When an event arrives from the kernel (because it passed our cgroup filter), we parse it into our shared struct and log an alert. This is where you would integrate with systems like Prometheus, Fluentd, or a SIEM.
Section 4: Edge Cases and Production Pitfalls
Deploying eBPF at scale requires navigating a landscape of subtle technical challenges.
bpf_probe_read_str
which have built-in safety.if (bpf_core_field_exists(...))
) and ship multiple eBPF programs, selecting the right one at runtime based on kernel capabilities.sys_enter_execve
probe has a subtle vulnerability. We read the filename argument from the process's memory. An attacker could theoretically change the contents of that memory after our eBPF probe reads it but before the kernel actually executes the syscall. The definitive solution is to use Linux Security Module (LSM) eBPF hooks (e.g., bpf_lsm_bprm_check_security
), which are called later in the execution path and operate on the kernel's trusted copy of the arguments. LSM hooks can also be used to block the syscall, moving from detection to prevention.read
or write
with complex logic can impact performance. It is critical to benchmark. Use tools like bpftool prog profile
to measure the CPU cycles consumed by your eBPF program under load. Ensure your maps are correctly sized; a full map will cause insert failures, and a hash map with many collisions will degrade lookup performance.Conclusion: The Future of Cloud-Native Security
We have successfully designed and implemented a sophisticated, Kubernetes-aware runtime security monitor. We didn't just dump events; we built an intelligent filtering system where the Kubernetes control plane programs the Linux kernel's security policy in real-time. This pattern of using eBPF maps as a dynamic, kernel-level filter, controlled by a user-space agent that understands application-level context (like Kubernetes labels), is a foundational technique for building scalable and efficient cloud-native security and observability tools.
By moving logic from user space into the kernel, we drastically reduced data volume and system load. We've seen how to use CO-RE for portable, production-ready deployments and acknowledged the complex realities of interacting with kernel and container abstractions. This is the direction the industry is heading. Projects like Cilium (for networking and security), Tetragon (for security observability), and Falco (for runtime security) are all built on these advanced eBPF principles. Mastering them is no longer optional for senior engineers working at the intersection of infrastructure, security, and performance.