eBPF for Real-Time K8s Pod Threat Detection via Syscall Hooks
The Observability Gap: Why Sidecars Fall Short for Runtime Security
In modern Kubernetes environments, runtime security is a non-negotiable requirement. The standard approach has often involved injecting a security agent as a sidecar container into each pod or deploying a privileged agent on each node. Both models present significant trade-offs that senior engineers must contend with.
Sidecars, while providing application-level context, introduce resource overhead (CPU/memory per pod), increase latency for network-based policies, and complicate the deployment lifecycle. Node-level agents, while more efficient, often rely on higher-level container runtime interfaces or periodic process table scans, creating a temporal gap where a malicious, short-lived process can execute and terminate between scans, completely evading detection.
Both approaches struggle with the kernel-userspace boundary. Intrusive methods like ptrace impose a heavy performance penalty, while less intrusive methods lack the fidelity to capture every critical system call. This is the core problem: we need a mechanism that offers comprehensive, kernel-level visibility with minimal performance overhead, tailored for the ephemeral and dynamic nature of Kubernetes pods. This is precisely the niche where eBPF (extended Berkeley Packet Filter) excels.
This article assumes you understand the fundamentals of eBPF—what it is, the role of the verifier, and the concept of maps. We will dive directly into building a sophisticated pod security monitoring tool that hooks directly into system calls to detect threats in real-time.
Core Implementation: Hooking `execve` with a `kprobe`
Our first objective is to detect every new process execution within any pod on a node. The canonical system call for this is execve. We'll use a kprobe (kernel probe) to dynamically attach our eBPF program to the entry point of the kernel function that handles this syscall.
1. The eBPF Kernel-Space Program (C)
This C code is not meant to be compiled with a standard C compiler like GCC. It's compiled using Clang/LLVM into eBPF bytecode. We'll use libbpf conventions and helpers.
pod_monitor.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
// Define a common structure for events to be sent to user space.
// This ensures a consistent data format.
#define TASK_COMM_LEN 16
#define MAX_FILENAME_LEN 256
struct event {
__u32 pid;
__u32 ppid;
char comm[TASK_COMM_LEN];
char filename[MAX_FILENAME_LEN];
};
// Ring buffer for sending events to user space.
// Ringbuf is generally preferred over perf buffers for high-throughput scenarios
// as it avoids data loss under heavy load and is more memory efficient.
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");
// This is the kprobe attached to the entry of the execve syscall.
// The function name format `__x64_sys_execve` is specific to the x86_64 architecture.
// Using tracepoints like `syscalls/sys_enter_execve` is often more stable across kernel versions,
// but we'll start with kprobe for demonstration.
SEC("kprobe/__x64_sys_execve")
int BPF_KPROBE(handle_execve, const struct pt_regs *regs)
{
struct event *e;
struct task_struct *task;
__u64 id;
__u32 pid, ppid;
const char *filename_ptr;
// Get PID and Parent PID
id = bpf_get_current_pid_tgid();
pid = id >> 32;
// Reserve space on the ring buffer for our event
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) {
return 0; // Not enough space, drop event
}
// Get task_struct for more info
task = (struct task_struct *)bpf_get_current_task();
ppid = BPF_CORE_READ(task, real_parent, tgid);
// Populate our event structure
e->pid = pid;
e->ppid = ppid;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
// Read the filename argument from the first register (x86_64 convention)
// This is an unsafe read from user-space memory, so we use a helper.
filename_ptr = (const char *)PT_REGS_PARM1(regs);
bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
// Submit the event to user space
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Key Implementation Details:
* vmlinux.h: This header is generated by bpftool and contains kernel type definitions. It's essential for CO-RE (Compile Once - Run Everywhere), allowing our program to work across different kernel versions without recompilation.
* BPF_MAP_TYPE_RINGBUF: We're using a ring buffer map. Unlike perf buffers, ring buffers are multi-producer, single-consumer lockless rings that prevent event overwriting under load, making them ideal for high-volume syscall monitoring.
* BPF_CORE_READ: This macro is a CO-RE helper that safely reads kernel struct fields. If the kernel struct layout changes in a future version, the eBPF loader can perform runtime relocations to ensure the correct field is accessed.
* bpf_probe_read_user_str: Reading arguments from syscalls involves reading from user-space memory, which is inherently unsafe. This helper function provides a safe way to copy the string data into our eBPF program's stack.
2. The User-Space Controller (Go)
This Go program is responsible for loading the eBPF bytecode into the kernel, attaching the probe, and listening for events from the ring buffer.
We'll use the excellent cilium/ebpf library.
main.go
package main
import (
"bytes"
"encoding/binary"
"log"
"os"
"os/signal"
"syscall"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/ringbuf"
"golang.org/x/sys/unix"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall -Werror" bpf pod_monitor.bpf.c -- -I./headers
const (
taskCommLen = 16
maxFilenameLen = 256
)
type event struct {
Pid uint32
Ppid uint32
Comm [taskCommLen]byte
Filename [maxFilenameLen]byte
}
func main() {
stopper := make(chan os.Signal, 1)
signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
// Allow the current process to lock memory for eBPF maps.
if err := unix.Setrlimit(unix.RLIMIT_MEMLOCK, &unix.Rlimit{Cur: unix.RLIM_INFINITY, Max: unix.RLIM_INFINITY}); err != nil {
log.Fatalf("failed to set rlimit: %v", err)
}
// Load pre-compiled programs and maps into the kernel.
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading objects: %v", err)
}
defer objs.Close()
// Attach the kprobe.
kp, err := link.Kprobe("__x64_sys_execve", objs.HandleExecve, nil)
if err != nil {
log.Fatalf("attaching kprobe: %v", err)
}
defer kp.Close()
log.Println("eBPF program attached. Waiting for events... Press Ctrl+C to exit.")
// Open a ring buffer reader from user space.
rd, err := ringbuf.NewReader(objs.Rb)
if err != nil {
log.Fatalf("opening ringbuf reader: %v", err)
}
defer rd.Close()
// Goroutine to handle process termination
go func() {
<-stopper
log.Println("Received signal, exiting...")
rd.Close()
}()
var ev event
for {
record, err := rd.Read()
if err != nil {
// Ringbuf is closed, probably due to program termination.
if err == ringbuf.ErrClosed {
return
}
log.Printf("reading from reader: %s", err)
continue
}
// Parse the raw data into our event struct.
if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &ev); err != nil {
log.Printf("parsing ringbuf event: %s", err)
continue
}
log.Printf("PID: %d, PPID: %d, Command: %s, Filename: %s",
ev.Pid,
ev.Ppid,
unix.ByteSliceToString(ev.Comm[:]),
unix.ByteSliceToString(ev.Filename[:]),
)
}
}
To run this:
clang, llvm, and libbpf-dev installed.bpftool btf dump file /sys/kernel/btf/vmlinux format c > headers/vmlinux.h.go generate to compile the C code and embed it into a Go file.sudo go run . (root privileges are required to load eBPF programs).Now, if you execute any command in another terminal (e.g., ls /tmp), you will see the corresponding event logged by our Go program.
Production Pattern: Correlating eBPF Events with Kubernetes Metadata
An event with a PID is useless in a Kubernetes context. We need to know which pod, namespace, and deployment this process belongs to. Simply querying the Docker/containerd daemon for every PID is inefficient and prone to race conditions. The robust solution is to build an in-memory cache by watching the Kubernetes API server.
Our user-space controller will now have two primary functions:
ContainerID -> PodMetadata cache.- Process eBPF events and enrich them using this cache.
We need a way to link a PID from an eBPF event to a container ID. The most reliable way is via the Cgroup ID. We'll modify our eBPF program to capture the Cgroup ID for each event.
1. Updated eBPF Program with Cgroup ID
pod_monitor.bpf.c (updated handle_execve)
// Add cgroup_id to the event struct
struct event {
__u64 cgroup_id;
__u32 pid;
__u32 ppid;
char comm[TASK_COMM_LEN];
char filename[MAX_FILENAME_LEN];
};
// ... (rest of the file is the same)
SEC("kprobe/__x64_sys_execve")
int BPF_KPROBE(handle_execve, const struct pt_regs *regs)
{
// ... (previous variable declarations)
// Get Cgroup ID
e->cgroup_id = bpf_get_current_cgroup_id();
// ... (rest of the function is the same)
bpf_ringbuf_submit(e, 0);
return 0;
}
2. Enhanced Go Controller with K8s API Watcher
We will use the official client-go library to interact with the Kubernetes API.
k8s_enricher.go
package main
import (
"context"
"log"
"path/filepath"
"regexp"
"sync"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/client-go/util/homedir"
)
type PodMetadata struct {
Name string
Namespace string
NodeName string
}
// CgroupManager is a thread-safe cache for mapping cgroup IDs to pod metadata.
type CgroupManager struct {
mu sync.RWMutex
cache map[uint64]PodMetadata
}
func NewCgroupManager() *CgroupManager {
return &CgroupManager{
cache: make(map[uint64]PodMetadata),
}
}
func (cm *CgroupManager) Get(cgroupID uint64) (PodMetadata, bool) {
cm.mu.RLock()
defer cm.mu.RUnlock()
meta, found := cm.cache[cgroupID]
return meta, found
}
// The core challenge: parsing the cgroup path to find the container ID,
// then mapping that to a cgroup inode ID.
// This is highly dependent on the container runtime's cgroup driver (systemd vs cgroupfs).
// For cgroupfs on Docker/containerd, paths look like:
// /kubepods/burstable/pod<POD_UID>/<CONTAINER_ID>
var podCgroupRegex = regexp.MustCompile(`pod([a-f0-9\-]+)`)
func (cm *CgroupManager) StartInformer(ctx context.Context, clientset *kubernetes.Clientset, nodeName string) {
factory := informers.NewSharedInformerFactoryWithOptions(clientset, 0, informers.WithTweakListOptions(func(options *metav1.ListOptions) {
options.FieldSelector = "spec.nodeName=" + nodeName
}))
podInformer := factory.Core().V1().Pods().Informer()
podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
// Logic to add pod and its containers to the cache
// You would read the cgroup ID from /sys/fs/cgroup/... for each container
// This is a complex part involving filesystem interaction and is omitted for brevity.
// The key is to map the container's cgroup inode number to the pod metadata.
},
DeleteFunc: func(obj interface{}) {
// Logic to remove pod and its containers from the cache
},
})
log.Println("Starting Kubernetes informer...")
factory.Start(ctx.Done())
factory.WaitForCacheSync(ctx.Done())
log.Println("Kubernetes informer synced.")
}
// main function would be updated to initialize this
func main_with_k8s() { // conceptual
nodeName := os.Getenv("NODE_NAME")
if nodeName == "" {
log.Fatal("NODE_NAME environment variable not set")
}
// K8s client setup
config, err := clientcmd.BuildConfigFromFlags("", filepath.Join(homedir.HomeDir(), ".kube", "config"))
if err != nil {
log.Fatal("failed to build kubeconfig")
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Fatal("failed to create clientset")
}
cgroupManager := NewCgroupManager()
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go cgroupManager.StartInformer(ctx, clientset, nodeName)
// ... eBPF loading and event loop from previous example ...
// Inside the event loop:
// var ev event (with CgroupID)
// ... read event from ring buffer ...
if meta, found := cgroupManager.Get(ev.CgroupID); found {
log.Printf("Pod: %s/%s, PID: %d, Command: %s, Filename: %s",
meta.Namespace, meta.Name, ev.Pid, /* command */, /* filename */)
} else {
// Event from a process not in a tracked pod (e.g., host process)
log.Printf("Host Process Event: PID: %d, ...", ev.Pid)
}
}
Edge Case: The Cgroup ID Mapping Problem
This is the hardest part of the implementation. The bpf_get_current_cgroup_id() helper returns the inode number of the cgroup directory the process belongs to. The user-space controller must:
- Watch for new pods on its node.
/sys/fs/cgroup/.stat syscall.inode_number -> PodMetadata.This is complex because the cgroup path structure depends on the CRI (Containerd, CRI-O) and the cgroup driver (systemd vs. cgroupfs). A production-ready implementation needs to handle these variations. It also needs to handle container restarts within a pod, which might get a new cgroup but belong to the same pod.
Advanced Use Case: Detecting Malicious File Access with `tracepoints`
Monitoring process execution is good, but monitoring file access is better for detecting threats like reading sensitive files (/etc/shadow) or writing to unexpected locations. For this, tracepoints are superior to kprobes. They are stable API points in the kernel, meaning they won't break with minor kernel updates.
We'll hook the sys_enter_openat tracepoint.
pod_monitor.bpf.c (additional program)
// ... existing event struct and ring buffer map ...
// Add a new event type to distinguish between exec and open
enum event_type {
EVENT_EXEC,
EVENT_OPEN,
};
struct event {
__u64 cgroup_id;
__u32 pid;
enum event_type type;
char comm[TASK_COMM_LEN];
char filename[MAX_FILENAME_LEN];
};
// Tracepoint for sys_enter_openat
// The context `struct trace_event_raw_sys_enter* ctx` is provided by the kernel
// and its structure is defined in vmlinux.h.
SEC("tracepoint/syscalls/sys_enter_openat")
int handle_openat(struct trace_event_raw_sys_enter* ctx)
{
struct event *e;
const char *filename_ptr;
// Reserve space
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) {
return 0;
}
// Populate common fields
e->cgroup_id = bpf_get_current_cgroup_id();
e->pid = bpf_get_current_pid_tgid() >> 32;
e->type = EVENT_OPEN;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
// The filename is the second argument to openat.
// We access it via the tracepoint context.
filename_ptr = (const char *)ctx->args[1];
bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
// Submit event
bpf_ringbuf_submit(e, 0);
return 0;
}
Your Go application would now need to:
handle_openat program to its tracepoint.event struct to include the type field.- In the event processing loop, switch on the event type to handle exec and open events differently.
IF event.type == EVENT_OPEN AND pod.name CONTAINS 'nginx' AND event.filename == '/etc/shadow' THEN ALERT.Performance Considerations & Production Hardening
Running eBPF programs on every syscall can have performance implications if not handled carefully.
* Strategy: Your user-space program can create an eBPF map (e.g., a hash map) and populate it with the cgroup IDs of all containers it's monitoring. The eBPF program then checks if the current process's cgroup ID exists in this map before sending an event.
* eBPF Code Snippet:
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 8192);
__type(key, __u64);
__type(value, __u8);
} monitored_cgroups SEC(".maps");
// In your kprobe/tracepoint:
__u64 cgroup_id = bpf_get_current_cgroup_id();
__u8 *is_monitored = bpf_map_lookup_elem(&monitored_cgroups, &cgroup_id);
if (!is_monitored) {
return 0; // Not a cgroup we care about, drop the event.
}
// ... proceed to send event
BPF_MAP_TYPE_RINGBUF is generally superior for high-volume event sources. It provides a per-CPU buffer that the user-space program can read from. It guarantees that events are not dropped (unlike perf buffers, which can overwrite old events under pressure) and avoids the expensive per-event wakeup notifications that perf buffers require.cilium/ebpf library to perform runtime relocations, making your program portable across a fleet of machines with varying kernel versions.Deployment as a Kubernetes DaemonSet
This tool is designed to run on every node in the cluster. The natural Kubernetes construct for this is a DaemonSet.
daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebpf-pod-monitor
namespace: kube-system
labels:
app: ebpf-pod-monitor
spec:
selector:
matchLabels:
app: ebpf-pod-monitor
template:
metadata:
labels:
app: ebpf-pod-monitor
spec:
hostPID: true # Required to see all PIDs on the host
tolerations:
- operator: Exists
serviceAccountName: ebpf-monitor-sa
containers:
- name: monitor
image: your-repo/ebpf-pod-monitor:latest
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
# This is the most critical and dangerous part.
# CAP_BPF and CAP_PERFMON are needed to load and manage eBPF programs.
# CAP_SYS_ADMIN is a broad capability often required for certain helpers.
# Running as privileged is the easiest way but least secure.
privileged: true
volumeMounts:
- name: bpf-fs
mountPath: /sys/fs/bpf
readOnly: false
volumes:
- name: bpf-fs
hostPath:
path: /sys/fs/bpf
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: ebpf-monitor-sa
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ebpf-monitor-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ebpf-monitor-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ebpf-monitor-role
subjects:
- kind: ServiceAccount
name: ebpf-monitor-sa
namespace: kube-system
Security Implications:
Loading eBPF programs requires significant privileges. The privileged: true flag gives the container almost complete control over the host kernel, which is a major security risk. In newer kernels, you can scope this down significantly using capabilities like CAP_BPF, CAP_PERFMON, and CAP_SYS_ADMIN. The principle of least privilege is paramount. Any vulnerability in your user-space Go controller could be catastrophic if it's running as a privileged container.
Conclusion
eBPF represents a paradigm shift in kernel observability and runtime security. By moving detection logic from user-space agents directly into the kernel, we can build tools that are orders of magnitude more performant and comprehensive than their traditional counterparts. We've demonstrated a practical, albeit complex, path to building a Kubernetes-aware security monitor: hooking syscalls, enriching events with pod metadata, and considering the critical performance and deployment patterns required for a production environment. While the implementation details, particularly around cgroup and PID management, are non-trivial, the payoff is a level of visibility into your running workloads that was previously unattainable without significant performance compromises.