Real-time Kubernetes Pod Intrusion Detection with eBPF Syscall Hooks
The Kernel Blind Spot in Container Security
In a mature Kubernetes environment, traditional host-based intrusion detection systems (HIDS) often fall short. They either lack the context of Kubernetes abstractions (pods, deployments, namespaces) or they operate with a significant visibility gap. A common approach, the sidecar security agent, introduces non-trivial latency, resource overhead, and configuration complexity for every pod. The fundamental challenge is that user-space security tools are inherently one step removed from the ground truth: the kernel.
When a malicious actor gains execution within a container, their actions manifest as a sequence of system calls (syscalls) handled by the host kernel. Monitoring execve, connect, openat, etc., provides the most accurate, real-time signal of a container's behavior. The problem is accessing this stream of kernel events efficiently and safely from a multi-tenant, containerized context. This is where eBPF (extended Berkeley Packet Filter) transitions from a networking utility to a revolutionary tool for observability and security.
This article bypasses introductory concepts. We assume you understand what eBPF is and why it's powerful. Instead, we will construct a production-viable, real-time process execution monitor for Kubernetes pods. We will write an eBPF program to hook the execve syscall, a user-space Go controller to process the events, and critically, a mechanism to enrich these low-level kernel events with high-level Kubernetes metadata. We will address performance, portability, and deployment patterns essential for any real-world implementation.
Core Architecture: eBPF Program and User-Space Controller
Our system will consist of two primary components:
tracepoint for the execve syscall and, upon execution, will collect process details (PID, command) and send them to user-space via an eBPF ringbuf map.ringbuf, enrich them with Kubernetes pod metadata, and evaluate them against a security policy.We choose a tracepoint (sys_enter_execve) over a kprobe because tracepoints provide a stable, well-defined kernel API, making our program more resilient to kernel version changes. For communication, we'll use ringbuf over the older perf buffer, as it offers better performance, avoids data overwrites under load, and has a more ergonomic API in modern eBPF libraries.
The eBPF Program: Capturing `execve`
Let's start with the C code for our eBPF program. We'll leverage modern eBPF development practices, including vmlinux.h for kernel type definitions (enabling CO-RE) and libbpf helpers.
bpf/trace_exec.c
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#define TASK_COMM_LEN 16
#define MAX_FILENAME_LEN 256
// Event structure sent to user-space
struct exec_event {
u32 pid;
u32 ppid;
char comm[TASK_COMM_LEN];
char filename[MAX_FILENAME_LEN];
};
// Ring buffer for sending events to user-space
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");
// Force emitting struct exec_event into the ELF.
const struct exec_event *unused __attribute__((unused));
SEC("tracepoint/syscalls/sys_enter_execve")
int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter *ctx) {
struct exec_event *e;
struct task_struct *task;
const char *filename_ptr;
u64 id;
u32 pid;
u32 ppid;
// Get PID and process information
id = bpf_get_current_pid_tgid();
pid = id >> 32;
task = (struct task_struct *)bpf_get_current_task();
if (!task) {
return 0;
}
ppid = BPF_CORE_READ(task, real_parent, tgid);
// Reserve space in the ring buffer
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) {
return 0;
}
// Populate the event structure
e->pid = pid;
e->ppid = ppid;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
// Read the filename argument from user-space memory
filename_ptr = (const char *)BPF_CORE_READ(ctx, args[0]);
bpf_probe_read_user_str(&e->filename, sizeof(e->filename), filename_ptr);
// Submit the event to the ring buffer
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
Key Implementation Details:
* CO-RE (Compile Once – Run Everywhere): By including vmlinux.h (generated by bpftool) and using BPF_CORE_READ macros, we make our program portable across different kernel versions without needing to recompile against specific kernel headers for each target node.
* bpf_get_current_task(): This helper gives us a pointer to the task_struct for the current process, which is our entry point for gathering context like the parent PID (ppid).
* bpf_probe_read_user_str(): The filename for execve exists in user-space memory. This helper safely copies the string from the process's address space into our eBPF event structure, handling page faults and memory protection.
* bpf_ringbuf_reserve/submit: This is the standard, high-performance pattern for sending data. We reserve a spot, populate the data directly into the shared memory buffer, and then submit it. This is more efficient than writing to a temporary stack variable and then copying it.
To build this, you'll need clang, libbpf, and bpftool. A typical build command looks like this:
# Generate vmlinux.h for CO-RE
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
# Compile the C code into an eBPF object file
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I. -c bpf/trace_exec.c -o trace_exec.o
The User-space Go Controller
Now we need a user-space process to load and manage this eBPF program. We'll use the cilium/ebpf library, which provides excellent Go bindings for interacting with the BPF subsystem.
main.go (Initial Version)
package main
import (
"bytes"
"encoding/binary"
"errors"
"log"
"os"
"os/signal"
"syscall"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/ringbuf"
"github.com/cilium/ebpf/rlimit"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall -Werror" bpf ./bpf/trace_exec.c -- -I./bpf
const (
taskCommLen = 16
maxFilenameLen = 256
)
type execEvent struct {
Pid uint32
Ppid uint32
Comm [taskCommLen]byte
Filename [maxFilenameLen]byte
}
func main() {
stopper := make(chan os.Signal, 1)
signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
// Allow the current process to lock memory for eBPF maps.
if err := rlimit.RemoveMemlock(); err != nil {
log.Fatal("Removing memlock limit failed:", err)
}
// Load pre-compiled programs and maps into the kernel.
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading objects: %s", err)
}
defer objs.Close()
// Attach the tracepoint program.
tp, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.TracepointSyscallsSysEnterExecve, nil)
if err != nil {
log.Fatalf("attaching tracepoint: %s", err)
}
defer tp.Close()
log.Println("eBPF program attached. Waiting for events... Press Ctrl-C to exit.")
// Open a ring buffer reader from user-space.
rd, err := ringbuf.NewReader(objs.Rb)
if err != nil {
log.Fatalf("opening ringbuf reader: %s", err)
}
defer rd.Close()
// Close the reader when the process receives a signal, which will exit the loop.
go func() {
<-stopper
log.Println("Received signal, exiting...")
if err := rd.Close(); err != nil {
log.Fatalf("closing ringbuf reader: %s", err)
}
}()
var event execEvent
for {
record, err := rd.Read()
if err != nil {
if errors.Is(err, ringbuf.ErrClosed) {
log.Println("Ring buffer closed.")
return
}
log.Printf("error reading from ringbuf: %s", err)
continue
}
// Parse the data from the ring buffer into our Go struct.
if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
log.Printf("parsing ringbuf event: %s", err)
continue
}
// At this point, we have the raw event. We will add enrichment later.
comm := string(event.Comm[:bytes.IndexByte(event.Comm[:], 0)])
filename := string(event.Filename[:bytes.IndexByte(event.Filename[:], 0)])
log.Printf("PID: %d, PPID: %d, Comm: %s, Filename: %s", event.Pid, event.Ppid, comm, filename)
}
}
Key Implementation Details:
* bpf2go: This command is a powerful utility from the cilium/ebpf library. It compiles the C code, embeds the resulting eBPF bytecode into a Go file, and generates Go structs and functions for loading and interacting with the eBPF programs and maps. This simplifies deployment immensely, as you only need to ship a single Go binary.
* rlimit.RemoveMemlock(): eBPF maps require locked memory. This function removes the RLIMIT_MEMLOCK resource limit for our process, which is necessary for the BPF syscalls to succeed.
* link.Tracepoint: This is the modern, libbpf-style way to attach eBPF programs. It creates a link that is automatically cleaned up when the tp.Close() is called, making the lifecycle management robust.
* ringbuf.NewReader: We create a reader to consume events from the rb map defined in our eBPF program.
* Event Loop: The for loop continuously reads records. The binary.Read call is crucial for parsing the raw byte slice from the kernel into our strongly-typed Go execEvent struct.
Running this on your host machine will immediately start printing every execve call across the entire system. But in a Kubernetes cluster, a raw PID is almost useless. We need context.
Advanced Context: Correlating PIDs with Kubernetes Metadata
This is the most critical step for making our tool useful in a Kubernetes environment. The kernel knows about PIDs, but not about Pods, Namespaces, or Deployments. Our user-space controller must act as a bridge, enriching the raw kernel events with this vital context.
The Challenge:
PID Namespacing: A PID inside a container is different from the PID on the host. Our eBPF program runs in the root PID namespace and sees the host* PID.
* Dynamic Nature: Pods are ephemeral. We need a low-latency way to look up pod information for a given PID without constantly querying the Kubernetes API server on every event.
* Race Conditions: A process might start and execute before our user-space controller has learned about the pod it belongs to.
The Solution: An In-Memory Metadata Cache
We will use the client-go library to create an informer that watches for Pod events in the cluster. We will use this to build and maintain an in-memory cache that maps Container IDs to Pod metadata. Then, we need a way to get the Container ID from a host PID.
The most reliable way to get the Container ID for a given process is by parsing its /proc/ file. The cgroup path for a containerized process on a typical system looks like this:
/kubepods/burstable/pod<POD_UID>/<CONTAINER_ID>
We can parse the CONTAINER_ID from this path and use it as the key for our cache lookup.
Let's evolve our Go controller:
k8s_watcher.go
package main
import (
"context"
"log"
"regexp"
"sync"
"time"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
)
type PodMetadata struct {
Namespace string
PodName string
PodUID string
Container string
}
// K8sWatcher maintains a cache of container ID -> pod metadata.
type K8sWatcher struct {
clientset *kubernetes.Clientset
cache sync.Map // concurrent map: string (containerID) -> PodMetadata
}
func NewK8sWatcher() (*K8sWatcher, error) {
config, err := clientcmd.BuildConfigFromFlags("", "") // In-cluster config
if err != nil {
return nil, err
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
return nil, err
}
return &K8sWatcher{clientset: clientset}, nil
}
func (kw *K8sWatcher) Run(ctx context.Context) {
factory := informers.NewSharedInformerFactory(kw.clientset, 30*time.Second)
podInformer := factory.Core().V1().Pods().Informer()
podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: kw.podAdded,
UpdateFunc: kw.podUpdated,
DeleteFunc: kw.podDeleted,
})
factory.Start(ctx.Done())
// Wait for the initial cache sync
if !cache.WaitForCacheSync(ctx.Done(), podInformer.HasSynced) {
log.Println("Timed out waiting for caches to sync")
return
}
log.Println("Kubernetes pod cache synced.")
<-ctx.Done()
}
func (kw *K8sWatcher) getPodMetadata(containerID string) (PodMetadata, bool) {
val, ok := kw.cache.Get(containerID)
if !ok {
return PodMetadata{}, false
}
return val.(PodMetadata), true
}
func (kw *K8sWatcher) podAdded(obj interface{}) {
pod := obj.(*corev1.Pod)
kw.updateCacheForPod(pod)
}
func (kw *K8sWatcher) podUpdated(oldObj, newObj interface{}) {
pod := newObj.(*corev1.Pod)
kw.updateCacheForPod(pod)
}
func (kw *K8sWatcher) podDeleted(obj interface{}) {
pod := obj.(*corev1.Pod)
for _, status := range pod.Status.ContainerStatuses {
containerID := parseContainerID(status.ContainerID)
if containerID != "" {
kw.cache.Delete(containerID)
}
}
}
func (kw *K8sWatcher) updateCacheForPod(pod *corev1.Pod) {
for _, status := range pod.Status.ContainerStatuses {
containerID := parseContainerID(status.ContainerID)
if containerID != "" {
meta := PodMetadata{
Namespace: pod.Namespace,
PodName: pod.Name,
PodUID: string(pod.UID),
Container: status.Name,
}
kw.cache.Store(containerID, meta)
}
}
}
// containerID format: containerd://<id>
var containerIDRegex = regexp.MustCompile(`^cri-o|containerd|docker://([a-f0-9]{64})$`)
func parseContainerID(fullID string) string {
matches := containerIDRegex.FindStringSubmatch(fullID)
if len(matches) == 2 {
return matches[1]
}
return ""
}
And now, we integrate this into our main event processing loop.
pid_resolver.go
package main
import (
"bufio"
"fmt"
"os"
"regexp"
"strings"
)
// This regex is simplified; a production one would need to be more robust
// to handle different cgroup drivers and layouts.
var cgroupRegex = regexp.MustCompile(`[0-9]+:[a-z_,=]+:/kubepods/[a-z]+/pod[a-f0-9\-]+/([a-f0-9]{64})`)
func getContainerIDFromPID(pid uint32) (string, error) {
cgroupPath := fmt.Sprintf("/proc/%d/cgroup", pid)
file, err := os.Open(cgroupPath)
if err != nil {
// Process might have exited already
return "", err
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
matches := cgroupRegex.FindStringSubmatch(line)
if len(matches) == 2 {
return matches[1], nil
}
}
return "", fmt.Errorf("container ID not found in %s", cgroupPath)
}
Updated main.go loop:
// ... (in main function, after setting up the watcher)
k8sWatcher, err := NewK8sWatcher()
if err != nil {
log.Fatalf("Failed to create k8s watcher: %v", err)
}
ctx, cancel := context.WithCancel(context.Background())
go k8sWatcher.Run(ctx)
// ... (inside the for loop)
for {
// ... read record ...
// ... parse event ...
go func(event execEvent) {
containerID, err := getContainerIDFromPID(event.Pid)
if err != nil {
// Likely a host process or short-lived container process, can be ignored or logged at a lower level.
return
}
meta, ok := k8sWatcher.getPodMetadata(containerID)
if !ok {
// Metadata not in cache yet. This is a race condition.
// A production system would queue this event for retry.
log.Printf("Metadata for container %s not found", containerID)
return
}
comm := string(event.Comm[:bytes.IndexByte(event.Comm[:], 0)])
filename := string(event.Filename[:bytes.IndexByte(event.Filename[:], 0)])
// We now have the full context!
log.Printf("ALERT: [Namespace: %s, Pod: %s, Container: %s] Process executed -> PID: %d, Comm: %s, Filename: %s",
meta.Namespace, meta.PodName, meta.Container, event.Pid, comm, filename)
}(event) // Process each event in a goroutine to avoid blocking the ring buffer read
}
// ... (handle shutdown)
cancel()
This enrichment logic is the core of a production system. Handling the race condition where a process starts before the pod is in our cache is a non-trivial edge case. A robust solution involves a temporary queue with a TTL for unenriched events, which are retried a few times before being discarded or logged as-is.
Production Pattern: Policy-Based Anomaly Detection
Logging every execve is noisy. The real value is in detecting anomalous executions. We can implement a simple policy engine that defines the expected behavior for our workloads.
Let's define a policy using a Kubernetes ConfigMap:
policy-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ebpf-detector-policy
namespace: security
data:
policy.yaml: |
default: "deny"
rules:
- name: "Allow known binaries in nginx pods"
namespace: "production"
podSelector:
app: "nginx"
allow:
- "/usr/sbin/nginx"
- "/bin/sh" # For init scripts, be specific
- name: "Deny shells in all pods"
namespace: "*"
deny:
- "/bin/bash"
- "/bin/sh"
- "/bin/zsh"
- "/usr/bin/script"
Our Go controller would watch this ConfigMap, parse the YAML, and use it to evaluate each enriched event. An execve call from an nginx pod to /usr/sbin/nginx would be allowed, but an execve to /bin/bash would be flagged as a high-severity alert.
Implementing the full policy engine is extensive, but the logic within the event processing goroutine would look like this:
// Inside the event processing goroutine...
enrichedEvent := EnrichedExecEvent{
Event: event,
Metadata: meta,
}
verdict := policyEngine.Evaluate(enrichedEvent)
if verdict.IsDeny() {
// Generate a high-fidelity security alert
// e.g., send to SIEM, log as structured JSON with ERROR level
log.Printf("POLICY VIOLATION: %s - Event: %+v", verdict.Reason, enrichedEvent)
} else {
// Log at DEBUG level or discard
}
Performance and Deployment Considerations
* Performance Overhead: The per-CPU overhead of our eBPF tracepoint is measured in nanoseconds. The primary cost is the data transfer to user-space and the processing there. Using ringbuf, processing events in goroutines, and an efficient in-memory cache are all crucial for keeping the user-space agent's CPU and memory footprint low.
* Kernel Compatibility & CO-RE: Our use of vmlinux.h and libbpf helpers is the modern standard. This CO-RE approach avoids the brittle need to recompile for every kernel version, which was a major operational headache for early eBPF adopters. Your build pipeline must include the bpftool step to generate vmlinux.h from a host with BTF info enabled (standard in modern Linux distributions).
* Architecture (arm64 vs x86_64): The syscall calling conventions differ between architectures. While our execve example is relatively simple, more complex syscalls with arguments in different registers would require conditional compilation (#ifdef __aarch64__) or more abstract BPF helpers to remain portable.
* Deployment as a DaemonSet: This tool must run on every node to provide complete coverage. A DaemonSet is the correct Kubernetes pattern. The pod spec requires significant privileges, which must be carefully managed.
daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebpf-exec-detector
namespace: security
spec:
selector:
matchLabels:
name: ebpf-exec-detector
template:
metadata:
labels:
name: ebpf-exec-detector
spec:
hostPID: true # Needed for /proc access
tolerations:
- operator: Exists
containers:
- name: detector
image: your-repo/ebpf-detector:latest
securityContext:
privileged: true # Easiest, but dangerous. Better to use capabilities.
# capabilities:
# add:
# - "SYS_ADMIN"
# - "BPF"
# - "PERFMON"
volumeMounts:
- name: proc
mountPath: /proc
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
Security Context: Running as privileged: true is the sledgehammer approach. A more secure, production-hardened setup would grant only the necessary capabilities: CAP_BPF and CAP_PERFMON (on recent kernels) or CAP_SYS_ADMIN (on older ones), and mount specific host paths like /sys/kernel/debug if needed.
Conclusion
We have designed and implemented the core of a sophisticated, kernel-level security monitor for Kubernetes. By leveraging eBPF, we gain unparalleled, low-overhead visibility into container behavior at the syscall level. The true complexity and value, however, lie in the user-space component. Enriching kernel events with Kubernetes context, managing an efficient metadata cache, and applying a declarative policy engine are what elevate this from a simple eBPF program to a production-grade intrusion detection tool. This architecture provides a foundation for extending monitoring to other critical syscalls (connect, openat, bpf), building a comprehensive picture of workload behavior and detecting threats in real-time, right where they happen.