Detecting Container Escapes with Advanced eBPF Syscall Monitoring

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Kernel as the Final Frontier for Container Security

In any mature Kubernetes or containerized environment, we layer security meticulously. We use static image scanners, enforce network policies with service meshes, implement RBAC, and follow the principle of least privilege. Yet, the ultimate boundary between a container and the host—and by extension, the entire cluster—is the Linux kernel. A sufficiently privileged or well-positioned attacker who can break this boundary has achieved a critical objective. This is the essence of a container escape.

Traditional security tools often operate too far from this boundary. A Web Application Firewall (WAF) won't see an attacker leveraging a kernel vulnerability. A static image scanner cannot predict the runtime behavior of a process spawned from a legitimate-looking binary. To effectively detect and respond to escape attempts, we must instrument the kernel itself.

This is where eBPF (extended Berkeley Packet Filter) transitions from a networking and observability tool into a formidable security instrument. It allows us to safely execute sandboxed programs within the kernel, providing visibility into syscalls, function calls, and network activity with negligible performance overhead. This article will not explain what eBPF is; it assumes you are familiar with its architecture. Instead, we will focus on building specific, high-signal detectors for common container escape vectors.

We'll explore two primary attack vectors and their eBPF-based detection patterns:

  • Namespace Manipulation: An attacker attempts to switch from the container's isolated namespaces (PID, mount, network) to the host's namespaces using the setns syscall.
  • Privilege Escalation via Capabilities: A process within a container attempts to abuse overly permissive Linux capabilities to interact with or modify the host kernel, for example, by loading a kernel module (CAP_SYS_MODULE).
  • Our goal is to create high-fidelity alerts by combining kernel-level data from eBPF with userspace context, all while navigating the performance pitfalls and deployment complexities of a real-world production system.


    Pattern 1: Detecting Malicious Namespace Switching with `kprobes`

    The setns syscall is the primary mechanism for a process to join an existing namespace. While container runtimes use it for legitimate purposes during container creation, a call to setns from an already running containerized process is highly suspicious. It's a classic indicator of an attacker trying to break out of their isolation.

    The eBPF Kernel-Side Probe

    Our strategy is to attach a kernel probe (kprobe) to the do_setns kernel function, which is the underlying implementation of the setns syscall. This allows us to intercept every attempt before it completes.

    We will use C with libbpf and a focus on CO-RE (Compile Once - Run Everywhere) principles using BTF (BPF Type Format) to ensure our program is portable across different kernel versions—a non-negotiable requirement for production environments.

    Here is the eBPF C code (setns_monitor.bpf.c).

    c
    // setns_monitor.bpf.c
    #include "vmlinux.h"
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_core_read.h>
    #include <bpf/bpf_tracing.h>
    
    // Event structure sent to userspace
    struct setns_event {
        u64 cgroup_id;
        u32 host_pid;
        u32 host_tgid;
        int ns_type;
        char comm[16];
    };
    
    // BPF ring buffer for high-performance event submission
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024); // 256 KB
    } rb SEC(".maps");
    
    // Kprobe on do_setns. We use pt_regs for argument access.
    SEC("kprobe/do_setns")
    int BPF_KPROBE(do_setns, int fd, int nstype) {
        struct task_struct *task = (struct task_struct *)bpf_get_current_task();
    
        // Get PID and TGID from the host perspective
        u64 id = bpf_get_current_pid_tgid();
        u32 tgid = id >> 32;
        u32 pid = id;
    
        // Optimization: filter out kernel threads (PID 0)
        if (tgid == 0) {
            return 0;
        }
    
        // Get the cgroup ID to identify the container
        u64 cgroup_id = bpf_get_current_cgroup_id();
    
        // Reserve space in the ring buffer
        struct setns_event *e;
        e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
        if (!e) {
            return 0; // Failed to reserve space
        }
    
        // Populate the event structure
        e->cgroup_id = cgroup_id;
        e->host_pid = pid;
        e->host_tgid = tgid;
        e->ns_type = nstype;
        bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
        // Submit the event to userspace
        bpf_ringbuf_submit(e, 0);
    
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Implementation Details:

    * vmlinux.h: This header is generated by bpftool and contains kernel type definitions, which is essential for CO-RE. It allows us to write code against kernel structures like task_struct in a portable way.

    * BPF_MAP_TYPE_RINGBUF: We use a ring buffer instead of the older BPF_PERF_OUTPUT. Ringbufs are more performant for high-volume event streams as they reduce per-event overhead and avoid the data copying issues of perf buffers.

    * bpf_get_current_cgroup_id(): This is the linchpin of our container awareness. The cgroup ID is a stable identifier for the container's control group. Our userspace agent will use this ID to resolve the actual container name and metadata.

    * bpf_get_current_pid_tgid(): We capture both the thread ID (pid) and the process ID (tgid) from the host's perspective. This is crucial because an attacker might use a short-lived thread to perform the escape.

    The Userspace Agent for Correlation and Alerting

    Raw kernel events are just data. The intelligence lies in the userspace agent that consumes, enriches, and analyzes these events. A production-grade agent should be written in a language like Go or Rust for performance and concurrency.

    Here is a simplified Go agent that consumes events from our ring buffer and applies detection logic.

    go
    // main.go
    package main
    
    import (
        "bytes"
        "encoding/binary"
        "fmt"
        "log"
        "os"
        "os/signal"
        "syscall"
    
        "github.com/cilium/ebpf/link"
        "github.com/cilium/ebpf/ringbuf"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang setns_monitor setns_monitor.bpf.c -- -I./headers
    
    const ( 
        // From /usr/include/linux/ns_common.h
        CLONE_NEWNS   = 0x00020000 /* New mount namespace */
        CLONE_NEWUTS  = 0x04000000 /* New utsname namespace */
        CLONE_NEWIPC  = 0x08000000 /* New ipc namespace */
        CLONE_NEWUSER = 0x10000000 /* New user namespace */
        CLONE_NEWPID  = 0x20000000 /* New pid namespace */
        CLONE_NEWNET  = 0x40000000 /* New network namespace */
    )
    
    func main() {
        // Handle Ctrl+C
        stopper := make(chan os.Signal, 1)
        signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
    
        // Load BPF objects
        objs := setns_monitorObjects{}
        if err := loadSetns_monitorObjects(&objs, nil); err != nil {
            log.Fatalf("loading objects: %v", err)
        }
        defer objs.Close()
    
        // Attach kprobe
        kp, err := link.Kprobe("do_setns", objs.DoSetns, nil)
        if err != nil {
            log.Fatalf("attaching kprobe: %s", err)
        }
        defer kp.Close()
    
        // Create a ring buffer reader
        rd, err := ringbuf.NewReader(objs.Rb)
        if err != nil {
            log.Fatalf("creating ringbuf reader: %s", err)
        }
        defer rd.Close()
    
        log.Println("Waiting for events... Press Ctrl+C to exit.")
    
        go func() {
            <-stopper
            rd.Close()
        }()
    
        var event setns_monitorSetnsEvent
        for {
            record, err := rd.Read()
            if err != nil {
                if err == ringbuf.ErrClosed {
                    log.Println("Received signal, exiting...")
                    return
                }
                log.Printf("reading from reader: %s", err)
                continue
            }
    
            // Parse the event data
            if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
                log.Printf("parsing ringbuf event: %s", err)
                continue
            }
    
            // *** CORE DETECTION LOGIC ***
            processDetection(event)
        }
    }
    
    func processDetection(event setns_monitorSetnsEvent) {
        // In a real system, this would be a sophisticated policy engine.
        comm := string(event.Comm[:bytes.IndexByte(event.Comm[:], 0)])
    
        // 1. Get host cgroup ID. This is a simplification. In reality, you'd
        //    read /proc/1/cgroup and parse the ID for cgroup2.
        hostCgroupID := uint64(1) // Placeholder for the root cgroup ID
    
        // 2. Check if the event is from a containerized process
        if event.CgroupId != hostCgroupID {
            // This is a process inside a container.
    
            // 3. Resolve container metadata (this is a critical step)
            // In a real system, you would query the Docker/containerd socket
            // or use a cache mapping cgroup IDs to container names.
            containerName := resolveContainerFromCgroupID(event.CgroupId)
    
            // 4. Implement the detection logic
            // Any setns call from within a container is suspicious. A call to the host
            // PID or mount namespace is a high-severity indicator of an escape.
            nsTypeStr := namespaceTypeToString(int(event.NsType))
            
            // This is where you would generate a detailed security alert.
            alert := fmt.Sprintf(
                "[ALERT] Possible Container Escape Attempt!\n"+
                "  Container: %s (cgroup_id: %d)\n"+
                "  Process: '%s' (PID: %d)\n"+
                "  Action: Attempted to switch to namespace of type '%s'\n",
                containerName, event.CgroupId, comm, event.HostTgid, nsTypeStr,
            )
            log.Println(alert)
    
            // Integrate with SIEM, Prometheus, etc.
        }
    }
    
    // A placeholder for a real implementation that would query the container runtime.
    func resolveContainerFromCgroupID(id uint64) string {
        // Example: query containerd via its socket to find matching container.
        // For this example, we'll just return a placeholder.
        return fmt.Sprintf("container-for-cgroup-%d", id)
    }
    
    func namespaceTypeToString(nstype int) string {
    	var nsTypes []string
    	if nstype&CLONE_NEWNS != 0 {
    		nsTypes = append(nsTypes, "mnt")
    	}
    	if nstype&CLONE_NEWUTS != 0 {
    		nsTypes = append(nsTypes, "uts")
    	}
    	if nstype&CLONE_NEWIPC != 0 {
    		nsTypes = append(nsTypes, "ipc")
    	}
    	if nstype&CLONE_NEWUSER != 0 {
    		nsTypes = append(nsTypes, "user")
    	}
    	if nstype&CLONE_NEWPID != 0 {
    		nsTypes = append(nsTypes, "pid")
    	}
    	if nstype&CLONE_NEWNET != 0 {
    		nsTypes = append(nsTypes, "net")
    	}
    	if len(nsTypes) == 0 {
    		return "unknown"
    	}
    	return fmt.Sprintf("%v", nsTypes)
    }
    

    Edge Cases and Production Considerations

    False Positives from Orchestrators: Kubernetes (via the Kubelet) and other container managers do* use setns for legitimate operations like executing probes (exec) or attaching to containers. A naive alert on every setns call from a containerized cgroup will flood your system. Your policy engine must be sophisticated enough to create an allow-list. For example, you could check the process's parent PID (PPID). If the parent is containerd-shim-runc-v2 or kubelet, the call is likely legitimate. This requires enriching the event data in userspace by reading /proc/[pid]/status.

    * Cgroup ID Resolution: The resolveContainerFromCgroupID function is critical. In a production Kubernetes node, you would need to interact with the CRI (Container Runtime Interface) socket (e.g., containerd.sock). You should build a cache mapping cgroup IDs to container metadata (name, image, pod, namespace) to avoid querying the runtime for every single event, which would be a performance bottleneck.

    * Short-Lived Containers: An attacker might spin up a short-lived container to perform the escape. By the time your userspace agent processes the event, the container might be gone. It's crucial that your agent can operate on potentially stale metadata and still flag the event based on the cgroup ID, which persists for a short while.


    Pattern 2: Monitoring Dangerous Capabilities with `tracepoints`

    Linux capabilities are a more granular way to grant root-like privileges. A container launched with --privileged or with a broad set of capabilities like CAP_SYS_ADMIN or CAP_SYS_MODULE is a security risk. The CAP_SYS_MODULE capability, for example, allows a process to load and unload kernel modules—a direct path to compromising the host kernel.

    Our goal is to detect when a process inside a container attempts to use a dangerous capability.

    The eBPF Kernel-Side Probe

    Instead of a kprobe, we can use a tracepoint, which is a more stable and efficient way to hook into kernel events. The sys_enter_capset tracepoint fires every time the capset syscall is entered. While we could also probe cap_capable, it's an extremely hot function, and probing it can introduce measurable overhead. capset is less frequent and often indicates an intentional change in privilege.

    Let's create cap_monitor.bpf.c.

    c
    // cap_monitor.bpf.c
    #include "vmlinux.h"
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_core_read.h>
    
    // Event for capability checks
    struct cap_event {
        u64 cgroup_id;
        u32 host_pid;
        char comm[16];
        // The capability bitmasks are complex. For simplicity, we can send a signal
        // and let userspace inspect /proc/[pid]/status for the full capability set.
        // Or, for specific capabilities, we can check them in-kernel.
        bool has_sys_module; // Example of a specific check
    };
    
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024);
    } cap_rb SEC(".maps");
    
    // We need the definition of cap_user_data_t
    struct cap_user_data_t {
        __u32 effective;
        __u32 permitted;
        __u32 inheritable;
    };
    
    SEC("tracepoint/syscalls/sys_enter_capset")
    int tracepoint__syscalls__sys_enter_capset(struct trace_event_raw_sys_enter *ctx) {
        u64 cgroup_id = bpf_get_current_cgroup_id();
    
        // Filter out host processes immediately
        // This ID needs to be discovered by the userspace agent and pushed
        // into a map, but we'll hardcode for this example.
        if (cgroup_id == 1) {
            return 0;
        }
    
        struct cap_user_data_t *data;
        data = (struct cap_user_data_t *) BPF_CORE_READ(ctx, args[1]);
    
        // Check for CAP_SYS_MODULE (bit 16)
        // See /usr/include/linux/capability.h
        bool has_sys_module = (BPF_CORE_READ(data, effective) & (1 << 16));
    
        // Only send an event if a dangerous capability is being set.
        if (!has_sys_module) {
            return 0;
        }
    
        struct cap_event *e = bpf_ringbuf_reserve(&cap_rb, sizeof(*e), 0);
        if (!e) {
            return 0;
        }
    
        e->cgroup_id = cgroup_id;
        e->host_pid = (u32)bpf_get_current_pid_tgid();
        e->has_sys_module = has_sys_module;
        bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
        bpf_ringbuf_submit(e, 0);
    
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Implementation Details:

    * tracepoint/syscalls/sys_enter_capset: This is a stable API. The arguments for tracepoints are accessed via the context (ctx) pointer, which is different from kprobes.

    In-Kernel Filtering: Notice the key optimization: if (!has_sys_module) { return 0; }. We perform the check for the dangerous capability inside the eBPF program*. This dramatically reduces the volume of events sent to userspace, as we only care about high-risk changes. This is a critical pattern for managing performance.

    Argument Reading: We read the syscall arguments from ctx->args. args[1] corresponds to the const cap_user_data_t __user data argument of the capset syscall. We then use BPF_CORE_READ to safely dereference this user-space pointer and inspect the capability bitmask.

    Userspace Policy Enforcement

    The userspace agent for this monitor would be similar to the setns one but with a different policy engine.

    go
    // In the userspace agent:
    func processCapEvent(event cap_monitorCapEvent) {
        containerName := resolveContainerFromCgroupID(event.CgroupId)
        comm := string(event.Comm[:bytes.IndexByte(event.Comm[:], 0)])
    
        // The policy can be based on image name, namespace, etc.
        // For example, a database container should never need CAP_SYS_MODULE.
        isAllowed := checkCapabilityPolicy(containerName, comm, "CAP_SYS_MODULE")
    
        if !isAllowed {
            alert := fmt.Sprintf(
                "[CRITICAL] Dangerous Capability Usage Detected!\n"+
                "  Container: %s\n"+
                "  Process: '%s' (PID: %d)\n"+
                "  Action: Attempted to set effective capability 'CAP_SYS_MODULE'. Possible rootkit installation attempt.\n",
                containerName, comm, event.HostPid,
            )
            log.Println(alert)
            // Trigger a high-priority alert to a SIEM or on-call security team.
        }
    }
    
    // Placeholder for a real policy engine that might read from a config file or API.
    func checkCapabilityPolicy(container, process, capability string) bool {
        // A real policy engine would be much more complex.
        // Example: Allowlist for a specific trusted debugging container.
        if strings.HasPrefix(container, "trusted-debug-tools") {
            return true
        }
        return false
    }

    This approach turns runtime security into a policy-driven system. You can define a baseline of expected behavior for your workloads and alert on any deviation, a core principle of zero-trust security.


    Productionizing and Scaling the eBPF Monitor

    Writing the eBPF programs is only half the battle. Deploying and managing them across a fleet of machines requires careful engineering.

  • Deployment as a DaemonSet: The userspace agent and its corresponding eBPF programs should be packaged into a container and deployed as a Kubernetes DaemonSet. This ensures the monitor runs on every node in the cluster. The container needs to run with sufficient privileges to load eBPF programs, typically requiring privileged: true or specific capabilities like CAP_BPF and CAP_PERFMON.
  • CO-RE is Non-Negotiable: As mentioned, using libbpf with BTF support is essential. Your build pipeline must include a step to generate vmlinux.h using bpftool. This decouples your eBPF program from the specific kernel version of your nodes, saving you from the nightmare of recompiling for every kernel update.
  • A snippet from a Makefile might look like this:

    makefile
        # Generate BPF skeletons from C code
        BPF_SRC = setns_monitor.bpf.c
        BPF_OBJ = setns_monitor.bpf.o
        GO_SRC = setns_monitor.go
    
        all: $(GO_SRC)
    
        $(GO_SRC): $(BPF_SRC)
        	go generate
    
        # The go:generate directive in main.go will invoke bpf2go
        # which handles the clang compilation and skeleton generation.
  • Alerting and Telemetry: The Go agent should not just log to stdout. It must integrate with your central observability stack.
  • * Metrics: Expose Prometheus metrics for the number of events processed, alerts fired, and errors encountered (ebpf_events_total, ebpf_alerts_fired_total).

    * Structured Logs/Alerts: Format alerts as JSON and forward them to a logging aggregator like Fluentd or a SIEM. The JSON payload must include rich context: pod name, Kubernetes namespace, image name, node name, and the full event details.

  • Performance Tuning and Safety:
  • * Resource Limits: The DaemonSet Pod should have CPU and memory limits defined. While eBPF is efficient, the userspace agent can consume resources, especially if its cache of cgroup IDs grows large.

    * BPF Verifier: Trust in the eBPF verifier. It statically analyzes your eBPF code before loading to prevent infinite loops, out-of-bounds memory access, and other unsafe operations that could crash the kernel. This is the fundamental safety guarantee that makes eBPF viable for production security.

    * Graceful Shutdown: The userspace agent must handle termination signals (SIGINT, SIGTERM) gracefully, ensuring it detaches its eBPF probes and closes all maps to clean up kernel resources properly. The defer kp.Close() pattern shown in the Go example is crucial.

    Tying it Together: A Multi-Stage Attack Scenario

    Imagine an attacker exploits a remote code execution vulnerability in a web application running in a pod. The pod was mistakenly configured with CAP_SYS_ADMIN.

  • Initial Foothold: The attacker gets a shell inside the webapp container.
  • Escape Attempt: The attacker knows CAP_SYS_ADMIN allows them to use setns. They write a small C program or use a script to call setns to join the host's mount namespace.
  • * DETECTION: Our setns_monitor eBPF program instantly captures this. The userspace agent receives the event, sees cgroup_id for the webapp pod, and fires a high-severity alert: [ALERT] Process 'exploit' in container 'webapp-pod-xyz' attempted to switch to host mount namespace.

  • Persistence on Host: If the alert is missed, the attacker, now in the host's mount namespace, can write a systemd service or a cron job to /etc/systemd/system or /etc/cron.d to establish persistence.
  • * FURTHER DETECTION: A separate file integrity monitoring (FIM) eBPF program (which could be built using similar principles to watch for writes to sensitive files) would detect this and fire another alert, correlating it to the same suspicious process.

    This demonstrates how eBPF provides the low-level, context-rich signals necessary to detect the individual steps of a complex attack chain in real-time.

    Conclusion

    eBPF is not a silver bullet, but it provides a level of runtime visibility that was previously impossible without intrusive agents or unstable kernel modules. By attaching lightweight, secure probes to strategic syscalls and kernel functions, we can build a powerful, low-overhead security monitoring system capable of detecting sophisticated container escape techniques.

    The patterns discussed here—monitoring setns for namespace manipulation and capset for privilege escalation—are just the beginning. The same approach can be extended to monitor file access (openat), network connections (tcp_connect), process execution (execve), and more. For senior engineers responsible for securing large-scale container platforms, mastering eBPF is no longer optional; it's a critical skill for building the next generation of runtime security defenses.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles