eBPF for Container Escape Detection via Syscall Auditing
The Kernel as the Final Frontier in Container Security
In modern cloud-native environments, we invest heavily in securing container images, managing network policies, and enforcing RBAC. Yet, a determined attacker who achieves remote code execution (RCE) within a container has a single objective: escape to the underlying host. Traditional host-based intrusion detection systems (HIDS) often lack the context to differentiate between legitimate and malicious container activity, while auditd can impose prohibitive performance overhead. This is where eBPF (extended Berkeley Packet Filter) provides a paradigm shift.
eBPF allows us to run sandboxed programs within the Linux kernel itself, triggered by events like syscalls, network packets, or kernel function calls. This gives us a programmable, performant, and secure observation point at the most fundamental boundary between a container and the host. This article eschews introductory concepts and dives directly into building a production-ready container escape detection agent. We will focus on intercepting namespace-switching syscalls—the primary vector for an escape—using eBPF kprobes, filtering events at the source using cgroup IDs, and processing them in a high-performance Go userspace application.
Our goal is to build a tool that can detect an attacker attempting to use a compromised container process to execute setns() to join the host's namespaces, a hallmark of a container escape. We will address the critical engineering challenges: portability across kernel versions using CO-RE, efficient data transfer from kernel to userspace with ring buffers, and the nuanced logic required to distinguish a real attack from legitimate administrative actions.
Section 1: Deconstructing the Escape: Syscalls as Choke Points
A container is, fundamentally, a collection of Linux namespaces (pid, mnt, net, ipc, uts, user) and cgroups. An escape is the act of a process breaking out of these constraints. While vulnerabilities like CVE-2019-5736 (runc) exist, a more common post-exploitation technique involves an attacker leveraging a privileged container (--privileged) or a misconfiguration to directly manipulate namespaces.
The most direct syscall for this purpose is setns(int fd, int nstype). An attacker with sufficient capabilities (like CAP_SYS_ADMIN) can open a file descriptor to a host namespace file (e.g., /proc/1/ns/pid) and use setns() to move their process into that namespace, effectively becoming a host process.
Therefore, our primary detection points are the kernel functions that implement these syscalls. We won't hook the syscall entry point itself (e.g., sys_enter_setns), as this often provides less context. Instead, we'll use kprobes to attach to the internal kernel functions, which gives us access to fully resolved kernel data structures.
Our target kernel functions will be:
do_setns(): The core kernel function that handles the logic for setns(). Intercepting it allows us to see the target namespace's inode number and type.do_mount(): Another vector where an attacker might try to mount a sensitive host directory (like /) into the container to gain access.By placing our eBPF probes here, we create a non-bypassable audit trail. No matter how obfuscated the userspace binary is, it must ultimately invoke these kernel functions to achieve its goal.
Section 2: The eBPF Kernel Program: CO-RE and In-Kernel Filtering
Our eBPF program, written in C, is the heart of our detection agent. We will design it with two principles in mind: performance and portability. Performance is achieved by filtering out irrelevant events inside the kernel to avoid the cost of sending data to userspace. Portability is achieved using CO-RE (Compile Once – Run Everywhere), which leverages BTF (BPF Type Format) to avoid the need to recompile the BPF program for every target kernel version.
Here is the complete C code for our BPF program (detector.bpf.c). We will break it down piece by piece.
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#define TASK_COMM_LEN 16
#define MAX_ENTRIES 10240
// Event structure sent to userspace
struct event {
__u64 cgroup_id;
__u32 host_pid;
__u32 host_tid;
__u32 host_ppid;
char comm[TASK_COMM_LEN];
// For setns
int nstype;
// For mount
char src[256];
char dst[256];
};
// BPF ring buffer for sending events to userspace
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");
// Optional: For filtering by host cgroup ID
const volatile u64 host_cgroup_id = 0;
// Force emitting struct event into the ELF.
const struct event *unused __attribute__((unused));
// Helper to get parent PID
static __always_inline __u32 get_ppid() {
struct task_struct *task = (struct task_struct *)bpf_get_current_task();
struct task_struct *parent;
// Use BPF_CORE_READ for portability
BPF_CORE_READ_INTO(&parent, task, real_parent);
if (parent) {
return BPF_CORE_READ(parent, tgid);
}
return 0;
}
// Kprobe attached to do_setns
SEC("kprobe/do_setns")
int BPF_KPROBE(do_setns, int fd, int nstype) {
u64 cgroup_id = bpf_get_current_cgroup_id();
// CRITICAL: Filter out events from the host itself.
if (host_cgroup_id != 0 && cgroup_id == host_cgroup_id) {
return 0;
}
u64 id = bpf_get_current_pid_tgid();
u32 tgid = id >> 32;
u32 pid = id;
struct event *e;
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) {
return 0;
}
e->cgroup_id = cgroup_id;
e->host_pid = tgid;
e->host_tid = pid;
e->host_ppid = get_ppid();
e->nstype = nstype;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
bpf_ringbuf_submit(e, 0);
return 0;
}
// Kprobe attached to do_mount
SEC("kprobe/do_mount")
int BPF_KPROBE(do_mount, const char *dev_name, const char *dir_name, const char *type, unsigned long flags, void *data) {
u64 cgroup_id = bpf_get_current_cgroup_id();
if (host_cgroup_id != 0 && cgroup_id == host_cgroup_id) {
return 0;
}
// We are interested in mounts that could expose host filesystem
// This is a naive check, a real implementation would be more robust
char slash = '/';
char proc[] = "proc";
char dev_first_char;
bpf_probe_read_user(&dev_first_char, sizeof(dev_first_char), dev_name);
if (dev_first_char != slash && bpf_strncmp(proc, sizeof(proc), (char*)dev_name) != 0) {
return 0;
}
struct event *e;
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) {
return 0;
}
u64 id = bpf_get_current_pid_tgid();
e->cgroup_id = cgroup_id;
e->host_pid = id >> 32;
e->host_tid = id;
e->host_ppid = get_ppid();
bpf_get_current_comm(&e->comm, sizeof(e->comm));
bpf_probe_read_user_str(&e->src, sizeof(e->src), dev_name);
bpf_probe_read_user_str(&e->dst, sizeof(e->dst), dir_name);
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
Key Components of the BPF Program:
#include "vmlinux.h": This is the magic of CO-RE. This header is generated from kernel debugging information (BTF) and contains all kernel type definitions for a specific architecture. The libbpf loader uses this to perform relocations at load time, adjusting our BPF code to match the memory layout of the running kernel.struct event: This is our data contract with the userspace application. It's crucial to keep this struct lean to minimize data transfer overhead. We include the cgroup_id as our primary container identifier, PID/TID/PPID for process context, and syscall-specific arguments.BPF_MAP_TYPE_RINGBUF: We use a ring buffer map (rb) to send data to userspace. Compared to the older BPF_MAP_TYPE_PERF_EVENT_ARRAY, ring buffers are more performant for high-volume event streams. They are lock-free, memory-efficient, and preserve the order of events. We allocate a 256KB buffer, which should be tuned based on expected event volume.host_cgroup_id: This const volatile variable is a placeholder. Our Go application will identify the cgroup ID of the host's root cgroup (usually 1) and update this variable in the BPF program before attaching the probes. This allows the kernel code if (cgroup_id == host_cgroup_id) { return 0; } to act as a highly efficient filter, immediately discarding any syscalls made by host processes.BPF_CORE_READ Macro: In the get_ppid helper, we need to traverse task_struct->real_parent->tgid. The memory offset of real_parent within task_struct can change between kernel versions. BPF_CORE_READ is a macro that libbpf resolves at load time, ensuring our code correctly accesses the field regardless of the kernel version, as long as BTF is available.kprobe/do_setns: This is our main probe. It fires every time the do_setns kernel function is called. It retrieves the current cgroup ID, filters against the host cgroup, and if it's from a container, it reserves space in the ring buffer, populates the event struct, and submits it.To compile this, you'll need clang and bpftool.
# Generate vmlinux.h
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
# Compile the BPF C code into an object file
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I. -c detector.bpf.c -o detector.bpf.o
Section 3: The Go Userspace Controller: Processing the Signal
The userspace application is responsible for loading the BPF object, managing its lifecycle, and intelligently processing the stream of events from the kernel. We'll use the cilium/ebpf library, which provides excellent Go bindings for interacting with the eBPF subsystem.
Here is the complete Go application (main.go).
package main
import (
"bytes"
"encoding/binary"
"errors"
"log"
"os"
"os/signal"
"syscall"
"fmt"
"io/ioutil"
"strings"
"github.com/cilium/ebpf"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/ringbuf"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall -Werror" bpf detector.bpf.c -- -I.
// This matches the C struct in detector.bpf.c
type event struct {
CgroupID uint64
HostPid uint32
HostTid uint32
HostPpid uint32
Comm [16]byte
NsType int32
Src [256]byte
Dst [256]byte
}
func main() {
stopper := make(chan os.Signal, 1)
signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
// Allow the current process to lock memory for eBPF maps.
// (This may not be necessary on newer kernels)
if err := syscall.Setrlimit(syscall.RLIMIT_MEMLOCK, &syscall.Rlimit{Cur: ^uint64(0), Max: ^uint64(0)}); err != nil {
log.Fatalf("failed to set rlimit: %v", err)
}
// Load pre-compiled programs and maps into the kernel.
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading objects: %v", err)
}
defer objs.Close()
// --- Advanced: In-kernel filtering setup ---
hostCgroupID, err := getHostCgroupID()
if err != nil {
log.Printf("Warning: could not get host cgroup id, continuing without in-kernel filtering: %v", err)
} else {
log.Printf("Found host cgroup ID: %d. Applying in-kernel filter.", hostCgroupID)
// Update the const variable in the BPF program
if err := objs.bpfMaps.host_cgroup_id.Update(uint32(0), hostCgroupID, ebpf.UpdateAny); err != nil {
log.Fatalf("failed to update host_cgroup_id map: %v", err)
}
}
// Attach kprobe to do_setns
kpSetns, err := link.Kprobe("do_setns", objs.DoSetns, nil)
if err != nil {
log.Fatalf("attaching kprobe to do_setns: %v", err)
}
defer kpSetns.Close()
log.Println("Attached kprobe to do_setns")
// Attach kprobe to do_mount
kpMount, err := link.Kprobe("do_mount", objs.DoMount, nil)
if err != nil {
log.Fatalf("attaching kprobe to do_mount: %v", err)
}
defer kpMount.Close()
log.Println("Attached kprobe to do_mount")
// Open a ringbuf reader from userspace RINGBUF map described in the
// eBPF C program.
rd, err := ringbuf.NewReader(objs.Rb)
if err != nil {
log.Fatalf("opening ringbuf reader: %v", err)
}
defer rd.Close()
// Close the reader when the process exits, freeing up resources.
go func() {
<-stopper
log.Println("Received signal, exiting...")
if err := rd.Close(); err != nil {
log.Fatalf("closing ringbuf reader: %v", err)
}
}()
log.Println("Waiting for events...")
var ev event
for {
record, err := rd.Read()
if err != nil {
if errors.Is(err, ringbuf.ErrClosed) {
log.Println("Received signal, exiting...")
return
}
log.Printf("reading from reader: %v", err)
continue
}
// Parse the data into the Go struct
if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &ev); err != nil {
log.Printf("parsing ringbuf event: %v", err)
continue
}
// Production logic starts here: enrich, correlate, and alert.
comm := string(ev.Comm[:bytes.IndexByte(ev.Comm[:], 0)])
if ev.NsType != 0 {
log.Printf("ALERT: Potential Container Escape!\n\tSyscall: setns\n\tComm: %s\n\tPID: %d\n\tPPID: %d\n\tCgroup ID: %d\n\tNamespace Type: %d",
comm, ev.HostPid, ev.HostPpid, ev.CgroupID, ev.NsType)
} else {
src := string(ev.Src[:bytes.IndexByte(ev.Src[:], 0)])
dst := string(ev.Dst[:bytes.IndexByte(ev.Dst[:], 0)])
log.Printf("ALERT: Potential Malicious Mount!\n\tSyscall: mount\n\tComm: %s\n\tPID: %d\n\tCgroup ID: %d\n\tSource: '%s'\n\tDestination: '%s'",
comm, ev.HostPid, ev.CgroupID, src, dst)
}
}
}
// getHostCgroupID finds the cgroup ID of the init process (PID 1), which represents the host.
func getHostCgroupID() (uint64, error) {
data, err := ioutil.ReadFile("/proc/1/cgroup")
if err != nil {
return 0, fmt.Errorf("reading /proc/1/cgroup: %w", err)
}
lines := strings.Split(string(data), "\n")
for _, line := range lines {
// For cgroup v2, the line is like: "0::/"
if strings.HasPrefix(line, "0::/") {
// This is a bit of a simplification. A more robust way would be to use statfs on /proc/self/ns/cgroup
// and get the inode number, but this works for many common setups.
// A truly robust solution would involve more complex parsing or using a library.
// For this example, we assume a simple cgroup v2 structure.
// A real implementation should handle v1 as well.
st, err := os.Stat("/proc/1/ns/cgroup")
if err != nil {
return 0, err
}
stat, ok := st.Sys().(*syscall.Stat_t)
if !ok {
return 0, errors.New("could not cast to syscall.Stat_t")
}
return stat.Ino, nil
}
}
return 0, errors.New("could not find cgroup v2 root for PID 1")
}
Analysis of the Go Controller:
go:generate: This magic comment automates the process of converting our BPF C code into a Go package. bpf2go handles compiling the C code and embedding the resulting object file into the Go binary, along with Go structs that mirror our BPF maps and programs. This creates a self-contained, distributable agent.RLIMIT_MEMLOCK: A classic requirement for eBPF programs. BPF maps and programs are locked into kernel memory to prevent them from being paged out to disk. This syscall.Setrlimit call gives our process permission to do so.loadBpfObjects function (generated by bpf2go) loads the program into the kernel. We then use link.Kprobe to attach our compiled BPF functions (objs.DoSetns, objs.DoMount) to the target kernel functions. The link package handles the cleanup, ensuring the probes are detached when kp.Close() is called.getHostCgroupID(). This function gets the inode number of the cgroup namespace for PID 1, which reliably identifies the host's root cgroup. We then use objs.bpfMaps.host_cgroup_id.Update() to patch the host_cgroup_id variable inside the loaded BPF program. Now, our kernel filter is live, and we've drastically reduced the amount of data we need to process.ringbuf.NewReader: We instantiate a reader for our ring buffer map. This provides a clean, channel-like API for consuming records from the kernel.for loop is our main processing logic. rd.Read() blocks until a new record is available from the kernel. We then perform a binary read to parse the raw byte slice into our Go event struct. The logic inside this loop is where a simple tool becomes a production-grade agent. Our example just logs an alert, but in a real system, this is where you would: * Enrich the event: Use the CgroupID to query the container runtime (e.g., Docker daemon, containerd) to find the container name, image, and labels.
* Correlate activity: Is this setns call part of a known administrative script? Is the parent process (HostPpid) a known shell or a suspicious binary?
* Maintain state: Track process ancestry to detect suspicious chains of execution.
* Send to a SIEM: Format the enriched alert and forward it to a security information and event management system.
Section 4: Production Hardening and Edge Cases
What we've built is a powerful detector, but production environments introduce complexity. Senior engineers must anticipate and handle these edge cases.
**Edge Case 1: The False Positive Problem**
Not every setns call is malicious. Tools like docker exec, kubectl exec, and even some liveness probes use setns to enter a container's namespaces. A naive alerter would be unacceptably noisy.
Solution: Advanced userspace logic. When an alert fires, the controller must gather more context:
/proc/[HostPid]/stat to walk the parent process chain. Is the parent dockerd, containerd, or kubelet? This is a strong signal of legitimate activity./proc/[HostPid]/exe. Is the binary a known administrative tool? You can maintain an allow-list of binary paths and their checksums.root? Is it from an interactive shell (/proc/[HostPid]/fd/0 is a TTY)?// Example of enrichment logic inside the loop
comm := string(ev.Comm[:bytes.IndexByte(ev.Comm[:], 0)])
exePath, err := os.Readlink(fmt.Sprintf("/proc/%d/exe", ev.HostPid))
if err == nil {
// Check exePath against an allow-list of binaries like /usr/bin/docker, /usr/bin/runc, etc.
if isAllowedBinary(exePath) {
log.Printf("INFO: Allowed setns call by %s (%s)", comm, exePath)
continue // Suppress alert
}
}
// ... proceed with alert
**Edge Case 2: High Event Volume and Dropped Events**
On a host with thousands of containers and high syscall rates, the ring buffer can fill up if the userspace consumer can't keep up. The kernel will start dropping events, creating blind spots.
Solution:
cilium/ebpf library's ringbuf.Reader doesn't directly expose drop counts, but you can monitor this with external tools like bpftool. For perf buffers, the library does provide this. It's a known limitation to be aware of.max_entries in the BPF map definition can be increased. This costs more locked kernel memory but provides more buffer space for event spikes.**Edge Case 3: PID Namespace Mapping**
The HostPid in our event is the PID from the host's perspective. For an operator debugging an alert, this is not useful. They know the container and the PID inside the container.
Solution: The userspace controller must perform PID translation. By reading /proc/[HostPid]/status, you can find the NSpid line, which lists the process's ID in each of its descendant PID namespaces.
// /proc/12345/status
...
NStgid: 12345 15
NSpid: 12345 15
NSpgid: 12345 15
...
This shows that host PID 12345 is known as PID 15 inside its container's PID namespace.
Section 5: Performance Under a Microscope
The primary justification for using eBPF is its low performance overhead. Let's quantify this.
* CPU Overhead: The cost is paid per syscall. The execution path of our BPF program is extremely short: a few helper calls and a memory write. On a modern CPU, this is measured in tens to hundreds of nanoseconds. For a system doing 100,000 relevant syscalls per second, the total CPU overhead would be negligible, typically well under 1% of a single core. This can be verified with bpftool prog profile.
* Memory Overhead: The cost is fixed. It's the size of the BPF programs plus the size of the maps. In our case, this is the 256KB ring buffer and a few kilobytes for the program instructions. This is a trivial amount of locked kernel memory.
Comparison to auditd: The auditd framework is powerful but notoriously expensive. Its rules are processed for every single syscall*, and its logging mechanism can create significant I/O and CPU load, sometimes adding 10-30% overhead to busy systems. Our eBPF program is far more targeted; the kprobe only fires on specific functions, and the in-kernel filter drops most events before they incur any data transfer cost.
Conclusion: A Modern Approach to Runtime Defense
We have successfully engineered a specific, high-performance container escape detector. This approach embodies the principles of modern systems security: moving detection as close to the source of truth as possible—the kernel—and leveraging its programmability to create intelligent, low-overhead tooling.
By combining kprobes for precise interception, CO-RE for kernel version portability, ring buffers for efficient data transfer, and intelligent in-kernel filtering, we built a solution that is both powerful and practical for production deployment. The real work of a senior engineer is not just in writing the BPF or Go code, but in understanding the complex interplay between the kernel and userspace, anticipating the edge cases of a real-world environment, and building the robust enrichment and correlation logic that turns raw data into actionable security intelligence.