Advanced K8s Runtime Security: eBPF Syscall Hooking for Intrusion Detection

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Observability Gap in Ephemeral Kubernetes Workloads

In modern Kubernetes environments, traditional security monitoring tools often fall short. Host-based intrusion detection systems (HIDS) relying on auditd generate overwhelming log volumes and introduce significant performance overhead. Sidecar proxies, while effective for network policy, create a blind spot for host-level and kernel-level exploits. Static container scanning is crucial but offers zero visibility into runtime behavior. What's needed is a mechanism that provides deep, real-time visibility into process behavior at the kernel level, with minimal performance impact, and is context-aware of Kubernetes primitives.

This is where the Extended Berkeley Packet Filter (eBPF) becomes a game-changer. For senior engineers, eBPF isn't magic; it's a kernel-level execution sandbox that allows us to attach custom, event-driven programs to various kernel hooks. Its efficiency stems from a JIT compiler that translates eBPF bytecode to native machine code and a stringent verifier that ensures memory safety and termination, preventing kernel panics. This allows for security instrumentation that is both powerful and safe.

This article isn't an introduction to eBPF. We assume you understand the basics of probes, maps, and the general architecture. Instead, we will focus on a specific, powerful implementation pattern: hooking the raw_syscalls:sys_enter tracepoint to build a sophisticated, Kubernetes-aware runtime security monitor from scratch.

Core Architecture: Syscall Hooking with `sys_enter`

The tracepoint:raw_syscalls:sys_enter tracepoint is our entry point. It fires for every single system call on the host, providing a struct syscalls_enter_args context that includes the syscall ID and its six arguments (stored in registers). This is a double-edged sword: it offers complete visibility but also presents a significant performance challenge if not handled correctly.

Our architecture will be as follows:

  • eBPF Kernel Program: Written in C, this program will be attached to the sys_enter tracepoint.
  • * It will identify the source container using bpf_get_current_cgroup_id().

    * It will filter syscalls based on their ID (e.g., __NR_execve, __NR_openat).

    * For matched syscalls, it will parse arguments directly from the registers.

    * It will push relevant event data into a BPF_PERF_OUTPUT map for userspace consumption.

  • Userspace Controller: Written in Python using the BCC (BPF Compiler Collection) framework for rapid prototyping.
  • * It loads and attaches the eBPF program.

    * It polls the perf buffer for events from the kernel.

    * It enriches the low-level kernel event with high-level Kubernetes metadata (Pod name, namespace, container image) by querying the container runtime or Kubernetes API.

    * It applies logic to the enriched event to determine if it constitutes a security alert.

    This will be deployed as a DaemonSet to ensure our monitor runs on every node in the cluster.

    Why not `kprobes` on specific `sys_` functions?

    While you could place a kprobe on sys_execve, using the sys_enter tracepoint offers a more stable and portable API. Kernel function names can change, but the tracepoint ABI is designed for stability. Furthermore, handling multiple syscalls from a single, centralized eBPF program is more efficient than attaching dozens of individual kprobes.

    Scenario 1: Detecting Unexpected Shell Execution in a Container

    A common attack vector is gaining execution within a container and spawning a shell (/bin/sh, /bin/bash) to explore the environment or escalate privileges. Our first goal is to detect this behavior.

    The eBPF Kernel Program

    We'll start with the C code for our eBPF program. This program will be compiled at runtime by BCC.

    c
    #include <uapi/linux/ptrace.h>
    #include <linux/sched.h>
    #include <linux/fs.h>
    
    // Define the data structure for events sent to userspace
    struct event_t {
        u64 cgroup_id;
        u32 pid;
        int syscall_id;
        char comm[TASK_COMM_LEN];
        char filename[256]; // DNAME_INLINE_LEN is common, but let's use a larger fixed size
    };
    
    // BPF_PERF_OUTPUT map to send events to userspace
    BPF_PERF_OUTPUT(events);
    
    // The main eBPF program attached to the tracepoint
    TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
        // We are only interested in the execve syscall
        if (args->id != __NR_execve) {
            return 0;
        }
    
        // Get process and container identifiers
        u64 cgroup_id = bpf_get_current_cgroup_id();
        u32 pid = bpf_get_current_pid_tgid() >> 32;
    
        // Create an event structure to populate
        struct event_t event = {};
        event.cgroup_id = cgroup_id;
        event.pid = pid;
        event.syscall_id = args->id;
    
        // Get the command name
        bpf_get_current_comm(&event.comm, sizeof(event.comm));
    
        // Read the filename argument from the first register (args->args[0])
        // bpf_probe_read_user_str is a helper to safely read a string from user space memory
        bpf_probe_read_user_str(&event.filename, sizeof(event.filename), (void *)args->args[0]);
    
        // Submit the event to the perf buffer for userspace to read
        events.perf_submit(args, &event, sizeof(event));
    
        return 0;
    }

    Key Implementation Details:

    * bpf_get_current_cgroup_id(): This is our link to the container world. The Cgroup ID is the key we'll use in userspace to map an event back to a specific container and pod.

    * bpf_probe_read_user_str(): This is a critical BPF helper. It safely copies a null-terminated string from the user-space memory of the process making the syscall into our eBPF program's stack. It's safe because the verifier ensures the read is bounded, preventing kernel memory corruption.

    * BPF_PERF_OUTPUT: This declares a perf buffer map named events. The perf_submit call pushes our populated event_t struct into this high-speed, memory-mapped buffer for efficient kernel-to-userspace communication.

    The Python Userspace Controller

    Now, let's write the Python script that loads this program and processes its events. This requires the bcc-tools package.

    python
    #!/usr/bin/python3
    
    from bcc import BPF
    import ctypes as ct
    import docker
    import time
    import logging
    
    # Configure logging
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    
    # C program from above
    bpf_text = """
    #include <uapi/linux/ptrace.h>
    #include <linux/sched.h>
    #include <linux/fs.h>
    
    struct event_t {
        u64 cgroup_id;
        u32 pid;
        int syscall_id;
        char comm[TASK_COMM_LEN];
        char filename[256];
    };
    
    BPF_PERF_OUTPUT(events);
    
    TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
        if (args->id != __NR_execve) {
            return 0;
        }
    
        u64 cgroup_id = bpf_get_current_cgroup_id();
        u32 pid = bpf_get_current_pid_tgid() >> 32;
    
        // Optimization: Filter out host processes early if possible.
        // In a real system, you might have a map of known container cgroup IDs.
        // For this example, we'll do all filtering in userspace.
    
        struct event_t event = {};
        event.cgroup_id = cgroup_id;
        event.pid = pid;
        event.syscall_id = args->id;
    
        bpf_get_current_comm(&event.comm, sizeof(event.comm));
        bpf_probe_read_user_str(&event.filename, sizeof(event.filename), (void *)args->args[0]);
    
        events.perf_submit(args, &event, sizeof(event));
    
        return 0;
    }
    """
    
    # Define the Python data structure to match the C struct
    class Event(ct.Structure):
        _fields_ = [
            ("cgroup_id", ct.c_ulonglong),
            ("pid", ct.c_uint),
            ("syscall_id", ct.c_int),
            ("comm", ct.c_char * 16), # TASK_COMM_LEN
            ("filename", ct.c_char * 256),
        ]
    
    # A cache to map cgroup_id to container info to avoid constant Docker API calls
    cgroup_cache = {}
    
    # Initialize Docker client
    try:
        client = docker.from_env()
    except Exception as e:
        logging.error(f"Could not connect to Docker daemon: {e}")
        exit(1)
    
    def get_container_info(cgroup_id):
        """Enrich cgroup ID with container metadata."""
        if cgroup_id in cgroup_cache:
            return cgroup_cache[cgroup_id]
    
        try:
            for container in client.containers.list():
                # This is a simplified lookup. A production system would parse /proc/PID/cgroup
                # to find the container ID and then inspect.
                # For this demo, we'll assume a direct mapping which may not hold.
                # A more robust way is to find the container ID from the cgroup path.
                # Example cgroup path: /sys/fs/cgroup/pids/docker/CONTAINER_ID/...
                # We'd need to parse this, but BCC doesn't provide the full path easily.
                # A better approach involves reading from /proc/self/cgroup inside the container
                # or using a more advanced agent.
                # For now, we simulate by listing all containers and finding a match.
                # This is NOT production-ready enrichment.
                cgroups = client.api.inspect_container(container.id)['HostConfig']['CgroupParent']
                # A more direct way is to check the cgroup path of a process in the container.
                # This is complex, so we'll just return a placeholder for the demo.
    
                # Let's simulate a lookup.
                # In a real system you'd use the container runtime API more intelligently.
                pass
    
        except Exception as e:
            logging.warning(f"Could not fetch container info: {e}")
    
        # In a real implementation, you'd find the container ID from the cgroupfs path
        # and then use the K8s API to get Pod/Namespace info.
        # For this example, we'll just return a placeholder.
        info = {"name": "unknown_container", "image": "unknown_image", "pod": "unknown_pod"}
        cgroup_cache[cgroup_id] = info
        return info
    
    # Callback function to process events from the perf buffer
    def print_event(cpu, data, size):
        event = ct.cast(data, ct.POINTER(Event)).contents
    
        filename = event.filename.decode('utf-8', 'replace')
        comm = event.comm.decode('utf-8', 'replace')
    
        # The core security logic
        suspicious_shells = ["/bin/sh", "/bin/bash", "/bin/zsh", "/bin/ash"]
        if filename in suspicious_shells:
            container_info = get_container_info(event.cgroup_id)
            logging.warning(
                f"[ALERT] Suspicious Shell Execution Detected!\n" \
                f"  Pod: {container_info['pod']} | Container: {container_info['name']} | Image: {container_info['image']}\n" \
                f"  PID: {event.pid} | Comm: {comm}\n" \
                f"  Syscall: execve | Executed: {filename}"
            )
        else:
            logging.info(f"Execve event: PID={event.pid}, Comm={comm}, Filename={filename}")
    
    # Main execution
    if __name__ == "__main__":
        logging.info("Attaching eBPF program to syscall tracepoint...")
        b = BPF(text=bpf_text)
        logging.info("eBPF program attached. Waiting for events...")
    
        # Open the perf buffer and set the callback function
        b["events"].open_perf_buffer(print_event)
    
        # Loop to poll the perf buffer
        while True:
            try:
                b.perf_buffer_poll()
            except KeyboardInterrupt:
                exit()

    To test this, run the Python script with sudo. Then, in another terminal, execute a shell inside any running container:

    docker exec -it /bin/sh

    You will see an alert fire from the Python script.

    Production Refinement: Kubernetes Metadata Enrichment

    The get_container_info function above is a placeholder. In a production Kubernetes environment, you would not use the Docker client. Instead, the DaemonSet's pod would be given a ServiceAccount with read-only permissions to the Kubernetes API. The enrichment process would be:

  • The eBPF program emits an event with a cgroup_id.
  • The userspace controller reads the cgroup path from /proc//cgroup.
  • The container ID is parsed from the cgroup path (e.g., /kubepods/burstable/pod/).
    • With the Pod UID and Container ID, query the Kubernetes API server (or a local cache populated by an Informer) to get the Pod name, namespace, labels, and other relevant metadata.

    This provides the necessary context to determine if the shell execution is anomalous for a pod running, for example, a Redis image versus a general-purpose debugging pod.

    Scenario 2: Detecting Sensitive File Access

    Another critical detection is unauthorized access to sensitive files, such as Kubernetes service account tokens, /etc/shadow, or the Docker socket.

    We can extend our eBPF program to also hook the openat syscall, which is used to open files.

    Modified eBPF Program

    c
    #include <uapi/linux/ptrace.h>
    #include <linux/sched.h>
    #include <linux/fs.h>
    
    #define TARGET_SYSCALL_EXECVE 59
    #define TARGET_SYSCALL_OPENAT 257
    
    struct event_t {
        u64 cgroup_id;
        u32 pid;
        int syscall_id;
        char comm[TASK_COMM_LEN];
        char filename[256];
    };
    
    BPF_PERF_OUTPUT(events);
    
    // Use a helper function to reduce code duplication
    static inline int handle_syscall(struct tracepoint__raw_syscalls__sys_enter *args) {
        u64 id = args->id;
    
        if (id != TARGET_SYSCALL_EXECVE && id != TARGET_SYSCALL_OPENAT) {
            return 0;
        }
    
        u64 cgroup_id = bpf_get_current_cgroup_id();
        u32 pid = bpf_get_current_pid_tgid() >> 32;
    
        struct event_t event = {};
        event.cgroup_id = cgroup_id;
        event.pid = pid;
        event.syscall_id = id;
    
        bpf_get_current_comm(&event.comm, sizeof(event.comm));
    
        // For both execve and openat, the filename is the first argument
        bpf_probe_read_user_str(&event.filename, sizeof(event.filename), (void *)args->args[0]);
    
        events.perf_submit(args, &event, sizeof(event));
        return 0;
    }
    
    TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
        return handle_syscall(args);
    }

    Modified Userspace Logic

    The Python controller would now need to handle both syscalls.

    python
    # ... (imports and class definitions as before)
    
    def print_event(cpu, data, size):
        event = ct.cast(data, ct.POINTER(Event)).contents
    
        filename = event.filename.decode('utf-8', 'replace')
        comm = event.comm.decode('utf-8', 'replace')
    
        # --- execve logic ---
        if event.syscall_id == 59: # __NR_execve
            suspicious_shells = ["/bin/sh", "/bin/bash"]
            if filename in suspicious_shells:
                # ... (alerting logic as before)
                logging.warning(f"[ALERT] Suspicious Shell: {filename} by {comm} (PID {event.pid})")
    
        # --- openat logic ---
        elif event.syscall_id == 257: # __NR_openat
            sensitive_files = [
                "/etc/shadow",
                "/etc/passwd",
                "/var/run/docker.sock",
            ]
            sensitive_prefixes = [
                "/root/.ssh/",
                "/var/run/secrets/kubernetes.io/serviceaccount/"
            ]
    
            is_sensitive = False
            if filename in sensitive_files:
                is_sensitive = True
            for prefix in sensitive_prefixes:
                if filename.startswith(prefix):
                    is_sensitive = True
                    break
            
            if is_sensitive:
                # ... (alerting logic with K8s enrichment)
                logging.critical(
                    f"[CRITICAL ALERT] Sensitive File Access Detected!\n" \
                    f"  PID: {event.pid} | Comm: {comm}\n" \
                    f"  Syscall: openat | File: {filename}"
                )
    
    # ... (main execution loop as before, loading the new BPF text)

    Now, if you run docker exec -it cat /etc/shadow, the monitor will immediately fire a critical alert.

    Advanced Considerations and Production Hardening

    While the above examples work, moving them to a high-throughput production cluster requires addressing several advanced topics.

    1. Performance and Overhead

    Attaching to sys_enter means our eBPF program runs for every syscall. The key to performance is to exit as early as possible for irrelevant events.

    * In-kernel Filtering: The if (id != ...) check is crucial. This is executed in nanoseconds. Avoid sending all syscalls to userspace for filtering; the cost of the perf buffer copy and userspace processing is orders of magnitude higher.

    * BPF Maps for Configuration: Instead of hardcoding syscall IDs, a production system would use a BPF_HASH or BPF_ARRAY map, populated from userspace, to store the set of syscalls to monitor. This allows dynamic policy updates without reloading the entire eBPF program.

    * String Comparison in Kernel: In the openat example, we deferred string matching to userspace. Performing this in the kernel is possible but tricky. The eBPF verifier heavily restricts loops to prevent unbounded execution. For prefix matching, you can use bpf_probe_read_user_str with a small size and then perform a bounded loop comparison. However, complex regex matching is not feasible and belongs in userspace.

    2. The CO-RE (Compile Once - Run Everywhere) Imperative

    BCC is excellent for development, but it has a major production drawback: it compiles the C code on every host it runs on, introducing dependencies on kernel headers (kernel-devel packages) and the Clang/LLVM toolchain. This makes container images bulky and creates a deployment dependency nightmare.

    libbpf + CO-RE is the modern solution. The workflow is:

    • The eBPF C code is compiled into a lightweight object file ahead of time.
    • This compilation includes BTF (BPF Type Format) information, a form of debugging info that describes kernel data structures.
  • At load time, a libbpf-based loader (written in Go, Rust, or C++) uses the BTF info to perform runtime relocations, adjusting the eBPF program to match the specific kernel version it's running on.
  • This eliminates the need for kernel headers and runtime compilation on production hosts, resulting in a small, self-contained binary that is far more portable and robust. Projects like Cilium, Tetragon, and Falco have all moved to this model.

    3. The Verifier Gauntlet: Living with Complexity Constraints

    The eBPF verifier is what makes eBPF safe, but it's also the biggest hurdle for developers. It statically analyzes your code to prove it will always terminate and will never access memory unsafely.

    * Bounded Loops: The verifier must be able to determine a finite upper bound on the number of loop iterations. A for (int i = 0; i < 256; i++) is usually acceptable if the verifier can prove i isn't modified in complex ways. Unbounded while loops are forbidden.

    * Stack Size Limit: The eBPF stack is limited to 512 bytes. Our struct event_t is already taking up a significant portion of this. Avoid large local variables; use BPF_PERCPU_ARRAY maps for scratch space if needed.

    * Path Complexity: The verifier analyzes all possible execution paths. Too many branches (if/else) can lead to a complexity limit exceeded error. Refactor code into static inline helper functions (which are often inlined by the compiler) to manage complexity.

    4. Deployment as a Kubernetes DaemonSet

    The only sane way to deploy this monitor is as a DaemonSet. The Pod spec requires elevated privileges:

    yaml
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: ebpf-security-monitor
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          name: ebpf-security-monitor
      template:
        metadata:
          labels:
            name: ebpf-security-monitor
        spec:
          hostPID: true
          hostNetwork: true
          tolerations:
          - operator: Exists
          containers:
          - name: monitor
            image: my-registry/ebpf-monitor:latest
            securityContext:
              privileged: true # Required for loading BPF programs
            volumeMounts:
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: debugfs
              mountPath: /sys/kernel/debug
          volumes:
          - name: sys
            hostPath:
              path: /sys
          - name: debugfs
            hostPath:
              path: /sys/kernel/debug

    * privileged: true is the easiest way to get the necessary capabilities (CAP_SYS_ADMIN, CAP_BPF, CAP_PERFMON). In a hardened environment, you would drop privileged: true and specify the exact capabilities needed.

    * hostPID: true allows processes in the pod to see all PIDs on the host, which is necessary for enriching events with process information.

    * Mounting /sys is required for interacting with the BPF filesystem.

    Conclusion: The Foundation of Modern Cloud-Native Security

    By hooking sys_enter with eBPF, we've built the foundational component of a sophisticated cloud-native runtime security tool. We've demonstrated how to gain kernel-level visibility into containerized workloads, detect common attack patterns in real-time, and do so with performance characteristics that are simply unattainable with older technologies.

    This pattern is not just a theoretical exercise; it's the core engine behind leading open-source projects like Falco, Tetragon, and the security observability features in Cilium. While building a complete, production-grade system requires solving challenges around policy management, alert correlation, and robust metadata enrichment at scale, understanding this fundamental eBPF-based syscall monitoring pattern is essential for any senior engineer working on the security or observability of modern infrastructure.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles