eBPF-based Runtime Security Monitoring for Kubernetes Pods
The Blind Spot: Runtime Security in Ephemeral Kubernetes Environments
As senior engineers, we've architected robust CI/CD pipelines, mastered infrastructure-as-code, and deployed complex microservices on Kubernetes. Yet, a persistent blind spot remains: runtime security. The dynamic, ephemeral, and multi-tenant nature of Kubernetes shatters traditional security paradigms. Static container scanning is essential but insufficient; it tells you nothing about what your code actually does once it's running. Network policies are crucial for segmentation but are blind to host-level process and file system activity.
Attempts to fill this gap often introduce significant trade-offs:
* Sidecar-based Agents: Deploying a security agent in a sidecar container for every application pod introduces non-trivial resource overhead (CPU, memory) and can increase network latency. It also lives in userspace, making it susceptible to the same container breakout vulnerabilities it's meant to detect.
* LD_PRELOAD
based Interception: This technique hooks into library calls but is notoriously brittle. It can be easily bypassed by statically linked binaries or applications that make direct syscalls. It's an application-level solution for a system-level problem.
* Host-based Intrusion Detection Systems (HIDS): Traditional HIDS agents running on the node often lack Kubernetes context. An alert for process X accessed file Y
is useless without knowing it originated from pod-abc
in namespace-prod
belonging to the billing-service
deployment.
What we need is a mechanism that provides deep, kernel-level visibility, is context-aware, safe, and has minimal performance overhead. This is precisely the problem that eBPF (extended Berkeley Packet Filter) solves.
This article is not an introduction to eBPF. It assumes you understand its core concepts: sandboxed, event-driven programs that run in a kernel VM, and the role of the verifier in ensuring safety. Instead, we will focus on architecting and implementing a production-grade, eBPF-based security monitor for Kubernetes from scratch.
Architecting Our eBPF Security Monitor
Our goal is to detect suspicious activity within our pods, such as unexpected file access or process execution. We'll build a system with three core components:
Here is a high-level view of the architecture on a single Kubernetes node:
graph TD
subgraph Kubernetes Node
subgraph Kernel Space
A[Syscall: openat, execve, etc.] -- Triggers --> B{eBPF Programs}
B -- Writes Events --> C[eBPF Ring Buffer Map]
B -- Reads/Writes State --> D[eBPF Hash/LPM Maps]
end
subgraph User Space
E[Userspace Agent (Go DaemonSet)] -- Loads/Attaches --> B
E -- Reads Events --> C
E -- Manages State --> D
E -- Queries --> F[Kubelet API]
E -- Queries --> G[Kubernetes API Server]
E -- Forwards Alerts --> H[SIEM / Logging Backend]
end
P1[Pod 1] --> A
P2[Pod 2] --> A
end
Our agent will run with sufficient privileges (CAP_SYS_ADMIN
, CAP_BPF
) to load eBPF programs. It will use the Kubernetes API to watch for pod lifecycle events on its node, building a local cache that maps container identifiers (like cgroup IDs) to rich metadata (pod name, namespace, labels, etc.). When an eBPF program detects an event, it will push the raw data into a high-performance ring buffer. The userspace agent reads from this buffer, uses the cgroup ID from the event to look up the Kubernetes context in its cache, and then makes a policy decision.
Part 1: The Kernel Probes for Syscall Tracing
The core of our detection logic resides in the eBPF programs. We'll use the libbpf
C library, which facilitates the modern CO-RE (Compile Once – Run Everywhere) approach. This avoids the painful process of compiling our eBPF code for every specific kernel version our nodes might be running.
Let's start by tracing the execve
syscall to monitor all new processes being executed inside our pods.
Handling `execve`: The Challenge of Syscall Arguments
Tracing execve
is more complex than it sounds. The syscall signature is int execve(const char pathname, char const argv[], char *const envp[]);
. When our eBPF program is triggered at the syscall's entry point (sys_enter_execve
), the arguments (pathname
, argv
) are pointers to userspace memory. eBPF programs run in kernel space and cannot directly dereference arbitrary userspace pointers for safety reasons. We must use a specific helper function, bpf_probe_read_user_str()
, to safely copy the data.
Here is our initial eBPF program (bpf/monitor.c
):
// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#define MAX_FILENAME_LEN 256
#define MAX_ARGS 16
#define MAX_ARG_LEN 128
// Event structure sent to userspace
struct exec_event {
u64 cgroup_id;
u32 pid;
u32 ppid;
char filename[MAX_FILENAME_LEN];
char comm[TASK_COMM_LEN];
u8 args_count;
char args[MAX_ARGS][MAX_ARG_LEN];
};
// Ring buffer for sending events to userspace
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");
// Optional: force GCC to emit BTF info for custom structs
const struct exec_event *unused __attribute__((unused));
SEC("tracepoint/syscalls/sys_enter_execve")
int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx) {
u64 id = bpf_get_current_pid_tgid();
u32 pid = id >> 32;
struct task_struct *task = (struct task_struct *)bpf_get_current_task();
u32 ppid = BPF_CORE_READ(task, real_parent, tgid);
struct exec_event *event;
event = bpf_ringbuf_reserve(&rb, sizeof(*event), 0);
if (!event) {
return 0; // Ring buffer is full, drop event
}
event->pid = pid;
event->ppid = ppid;
event->cgroup_id = bpf_get_current_cgroup_id();
bpf_get_current_comm(&event->comm, sizeof(event->comm));
// Safely read filename from userspace pointer
const char __user* filename_ptr = (const char __user*)ctx->args[0];
bpf_probe_read_user_str(&event->filename, sizeof(event->filename), filename_ptr);
// Safely read argv from userspace
const char __user* const __user* argv_ptr = (const char __user* const __user*)ctx->args[1];
event->args_count = 0;
#pragma unroll
for (int i = 0; i < MAX_ARGS; i++) {
const char __user* argp = NULL;
bpf_probe_read_user(&argp, sizeof(argp), &argv_ptr[i]);
if (!argp) {
break;
}
bpf_probe_read_user_str(&event->args[i], MAX_ARG_LEN, argp);
event->args_count++;
}
bpf_ringbuf_submit(event, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Key Advanced Concepts in this Code:
vmlinux.h
and CO-RE: We include vmlinux.h
, which is a header file generated from kernel debugging information (BTF). This allows libbpf
to understand kernel data structures (task_struct
) and perform relocations at load time, making our eBPF object portable across different kernel versions.BPF_CORE_READ
: This macro is used for safely accessing fields of kernel structs. It's part of the CO-RE mechanism. For example, BPF_CORE_READ(task, real_parent, tgid)
correctly reads the parent process ID regardless of how the task_struct
layout changes between kernels.BPF_MAP_TYPE_RINGBUF
: We use a ring buffer map, the modern and most performant way to send data from kernel to userspace. It's a multi-producer, single-consumer (MPSC) lock-free buffer that avoids the overhead and potential for lost events seen in older BPF_MAP_TYPE_PERF_EVENT_ARRAY
mechanisms.bpf_get_current_cgroup_id()
: This is the critical piece of the puzzle for Kubernetes context. It retrieves the ID of the cgroup v2 the current process belongs to. Since Kubernetes uses cgroups to isolate containers, this ID is our primary key for mapping a kernel event back to a specific pod and container.#pragma unroll
: The eBPF verifier forbids loops with variable bounds to prevent unbounded execution. By using #pragma unroll
, we tell the compiler to unroll the loop, creating a fixed number of instructions that the verifier can analyze and approve.Part 2: The Go Userspace Agent
Now we need a Go program to load and interact with our eBPF code. We'll use the excellent cilium/ebpf
library.
The agent's responsibilities are:
- Generate eBPF object files from our C code.
- Load the eBPF object file into the kernel.
- Attach the eBPF program to the specified tracepoint.
- Open the ring buffer map and start polling for events.
- (Crucially) Connect to the Kubernetes API to build a cgroup-to-pod metadata map.
- Process events: enrich them with metadata and log them.
Building and Loading the eBPF Program
First, we need a way to compile our C code into an eBPF object file. We'll use clang
and llvm-strip
. A Makefile
helps automate this:
CLANG ?= clang
LLVM_STRIP ?= llvm-strip
BPF_TARGET_ARCH ?= $(shell uname -m | sed 's/x86_64/bpf/' | sed 's/aarch64/bpf/')
VMLINUX_H = vmlinux.h
all: monitor.bpf.o
# Generate vmlinux.h from BTF info
$(VMLINUX_H):
bpftool btf dump file /sys/kernel/btf/vmlinux format c > $(VMLINUX_H)
# Compile C code to eBPF object file
monitor.bpf.o: bpf/monitor.c $(VMLINUX_H)
$(CLANG) \
-g -O2 -target $(BPF_TARGET_ARCH) \
-I. \
-c bpf/monitor.c \
-o $@
$(LLVM_STRIP) -g $@
clean:
rm -f monitor.bpf.o $(VMLINUX_H)
We can embed the compiled object file directly into our Go binary using go:embed
.
The Go Agent Code (`main.go`)
Here's a simplified but functional version of the agent's core logic.
package main
import (
"bytes"
"context"
"encoding/binary"
"errors"
"log"
"os"
"os/signal"
"syscall"
"github.com/cilium/ebpf"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/ringbuf"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target bpfel -type exec_event bpf ./bpf/monitor.c -- -I./
const (
TASK_COMM_LEN = 16
MAX_FILENAME_LEN = 256
MAX_ARGS = 16
MAX_ARG_LEN = 128
)
func main() {
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
// Load pre-compiled programs and maps into the kernel.
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading objects: %v", err)
}
defer objs.Close()
// Attach the tracepoint program.
tp, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.TracepointSyscallsSysEnterExecve, nil)
if err != nil {
log.Fatalf("attaching tracepoint: %v", err)
}
defer tp.Close()
log.Println("eBPF programs loaded and attached. Waiting for events...")
// Open a ringbuf reader from the eBPF map.
rd, err := ringbuf.NewReader(objs.Rb)
if err != nil {
log.Fatalf("opening ringbuf reader: %s", err)
}
defer rd.Close()
// Goroutine to handle closing the reader on interrupt.
go func() {
<-ctx.Done()
rd.Close()
}()
var event bpfExecEvent
for {
record, err := rd.Read()
if err != nil {
if errors.Is(err, ringbuf.ErrClosed) {
log.Println("Received signal, exiting...")
return
}
log.Printf("error reading from ringbuf: %s", err)
continue
}
// Parse the event data.
if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
log.Printf("parsing ringbuf event: %s", err)
continue
}
// *** This is where enrichment would happen ***
// podMeta := metadataCache.Get(event.CgroupId)
log.Printf("PID: %d, PPID: %d, Cgroup: %d, Comm: %s, Filename: %s",
event.Pid, event.Ppid, event.CgroupId, unix.ByteSliceToString(event.Comm[:]), unix.ByteSliceToString(event.Filename[:]))
}
}
Important Details:
* bpf2go
: This tool from the cilium/ebpf
project is fantastic. It takes our C eBPF code, compiles it, and generates Go code that handles loading the object file and provides typed Go structs that match our C structs. This eliminates a ton of boilerplate.
* link.Tracepoint
: This is the modern way to attach eBPF programs, replacing older ioctl
-based methods. It returns a link.Link
object, and closing this link automatically detaches the program.
* ringbuf.NewReader
: We create a reader for our ring buffer map. The Read()
call blocks until a new event is available.
Part 3: The Missing Piece - Kubernetes Context Enrichment
Receiving a cgroup_id
is great, but it's just a number. To make our security events actionable, we must map this ID to a Pod, Namespace, and Deployment.
This is a non-trivial engineering problem in itself. The general pattern is:
rest.Config
to talk to the API server.WATCH
for pod events on the node it's running on. We can get the node name from the NODE_NAME
environment variable (injected via the Downward API in the DaemonSet manifest).metadata
(name, namespace, labels) and status.containerStatuses
. The containerID
field in the container status (e.g., containerd://
) is key.pod_uid
and container ID container_id
, the cgroup path might look something like /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod.slice/cri-containerd-.scope
./sys/fs/cgroup/...
) and store a mapping: cgroup_id -> PodMetadata
in a concurrent-safe map.When our eBPF event loop receives an event with CgroupId: 12345
, it does a quick lookup in this local cache. Now, the log message can be transformed from cgroup 12345 executed /bin/bash
to ALERT: Shell spawned in pod 'billing-api-7f... ' (namespace: prod, deployment: billing-api)
. This is an actionable security signal.
Here's a conceptual snippet of the enrichment logic:
// In your agent's main struct
type Agent struct {
// ... other fields
metadataCache *sync.Map // In a real implementation, use a more structured, TTL-based cache
}
// Goroutine that watches the K8s API
func (a *Agent) watchPods(ctx context.Context, clientset *kubernetes.Clientset, nodeName string) {
// ... setup watcher for pods on `nodeName` ...
for event := range watcher.ResultChan() {
pod, ok := event.Object.(*v1.Pod)
if !ok { continue }
switch event.Type {
case watch.Added, watch.Modified:
// For each container in the pod, get its cgroup ID and update the cache
cgroupId := getCgroupIdForPod(pod) // This is a complex helper function!
a.metadataCache.Store(cgroupId, buildPodMetadata(pod))
case watch.Deleted:
cgroupId := getCgroupIdForPod(pod)
a.metadataCache.Delete(cgroupId)
}
}
}
// In the event processing loop
func (a *Agent) processEvent(event bpfExecEvent) {
var meta string
if podMeta, ok := a.metadataCache.Load(event.CgroupId); ok {
meta = fmt.Sprintf("Pod: %s, Namespace: %s", podMeta.Name, podMeta.Namespace)
} else {
meta = "(unknown context)"
}
log.Printf("EXEC EVENT [%s]: %s", meta, unix.ByteSliceToString(event.Filename[:]))
}
Production Considerations, Edge Cases, and Performance
Building a toy is one thing; running it in production is another. Here are the critical considerations for a senior engineering team.
Edge Case: Containerd Shim Processes
When you kubectl exec
into a container, you might not see the execve
event from your target process directly. Instead, you might see containerd's shim process (containerd-shim-runc-v2
) executing the command. The process you're interested in will be a child of this shim. This means robust detection requires process ancestry tracking. You'd need to trace fork
and exec
calls and maintain a process tree in an eBPF map (BPF_MAP_TYPE_HASH
mapping pid
-> parent_pid/metadata
) to correctly attribute the final executed binary to the originating pod context.
Performance: Kernel Filtering is Key
Sending every single execve
event from the kernel to userspace is inefficient and noisy. On a busy system, this could be thousands of events per second. The power of eBPF is the ability to filter in the kernel.
Example: Let's only send events for processes executed out of suspicious directories like /tmp
or /var/tmp
.
We can't do string comparison directly in eBPF easily. A common pattern is to send the event to userspace and let it do the filtering. A more advanced, but highly efficient, pattern is to use a set of eBPF maps.
BPF_MAP_TYPE_HASH
map where the key is a char[256]
(the filename) and the value is a u8
(boolean flag)./usr/bin/python3
, /app/server
).bpf_map_lookup_elem()
on this map. If the filename is not found in the allowlist, then and only then does it reserve space in the ring buffer and send the event.This reduces kernel-userspace traffic by orders of magnitude, dramatically lowering the agent's CPU overhead.
High Event Volume and Ring Buffer Overflows
Even with filtering, a security incident (like a fork bomb) could generate a massive burst of events, potentially overflowing the ring buffer. The bpf_ringbuf_reserve
call will start returning NULL
, and we will lose events.
This is a fundamental trade-off in observability. While you can increase the ring buffer size, it consumes non-swappable kernel memory. The better strategy includes:
* In-kernel Aggregation: For some event types (e.g., network connections), you can aggregate stats in an eBPF map. For instance, map (source_ip, dest_ip) -> count
and only send an update to userspace periodically or when a threshold is breached.
* Userspace Backpressure: The userspace agent should be designed to consume events as fast as possible. If its downstream sink (e.g., a SIEM) is slow, the agent should buffer events in userspace memory or even drop less critical events to keep up with the kernel.
* Monitoring: The agent must monitor for dropped events. The cilium/ebpf/ringbuf
reader doesn't directly expose this, but in a perf_event_array
setup, you can read a counter of lost samples. For ring buffers, you'd need to implement a health check in your agent to see if it's falling behind.
Deploying to Kubernetes
Our agent must be deployed as a DaemonSet to ensure it runs on every node. The manifest requires specific security contexts and volume mounts to access kernel resources.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebpf-security-monitor
namespace: kube-system
spec:
selector:
matchLabels:
name: ebpf-security-monitor
template:
metadata:
labels:
name: ebpf-security-monitor
spec:
tolerations:
- operator: Exists
hostPID: true # Required for process ancestry tracking
containers:
- name: monitor-agent
image: your-registry/ebpf-monitor-agent:latest
securityContext:
privileged: true # Simplest way, but can be locked down
# For a non-privileged setup, you need:
# capabilities:
# add:
# - SYS_ADMIN
# - BPF
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: bpf-fs
mountPath: /sys/fs/bpf
mountPropagation: HostToContainer
- name: debug-fs
mountPath: /sys/kernel/debug
volumes:
- name: bpf-fs
hostPath:
path: /sys/fs/bpf
- name: debug-fs
hostPath:
path: /sys/kernel/debug
Conclusion: The Future is Kernel-Level
We have journeyed from the high-level problem of runtime security in Kubernetes down to the intricate details of kernel-level syscall tracing, memory management, and context enrichment. By leveraging eBPF with a CO-RE approach, we've built the foundation of a security tool that is highly performant, difficult to evade, and deeply context-aware.
This approach is not a silver bullet, but it represents a paradigm shift. It moves security monitoring from brittle application-level shims or resource-intensive sidecars to a centralized, efficient, and programmable layer within the kernel itself. This is the same technology that powers modern cloud-native networking (Cilium), service meshes (Istio's eBPF datapath), and observability (Pixie, Parca).
For senior engineers tasked with securing complex, large-scale systems, mastering eBPF is no longer optional. It is the key to unlocking the next generation of secure, observable, and high-performance cloud-native infrastructure.