Kernel-Level Container Security with eBPF for Anomaly Detection
Beyond the Sidecar: Runtime Anomaly Detection with eBPF
In modern cloud-native environments, container security is paramount. Static image scanning and network policies are foundational, but they fail to address runtime threats—malicious processes executed or network connections initiated after a container is already running. Traditional runtime security tools often rely on ptrace, LD_PRELOAD hooking, or network-level sidecar proxies. These approaches, while functional, introduce significant performance overhead, increase attack surface, or lack the complete visibility needed to detect sophisticated attacks.
Enter eBPF (extended Berkeley Packet Filter). eBPF allows us to run sandboxed programs directly within the Linux kernel, triggered by events like syscalls, function entries/exits, or network packets. For security engineering, this is a paradigm shift. We can achieve unparalleled visibility into container behavior with minimal performance impact, all without modifying the application code or container image.
This article is not an introduction to eBPF. It assumes you understand the basic concepts of eBPF programs, maps, and the loader/userspace controller architecture. We will dive straight into building a practical, production-oriented runtime security monitor that detects two common indicators of compromise:
We will implement this using a CO-RE (Compile Once - Run Everywhere) approach with libbpf and a Go-based userspace controller, ensuring our solution is portable across different kernel versions.
The Architecture: Kernel Hooks and a Userspace Brain
Our security monitor consists of two main components:
kprobes). These programs will collect event data (e.g., filename for execve, destination IP for connect) and send it to userspace.We'll use a BPF_MAP_TYPE_PERF_EVENT_ARRAY to stream data from kernel to userspace. This is a highly efficient mechanism for handling a high volume of events.
Let's start by setting up our project structure.
.
├── go.mod
├── go.sum
├── main.go # Go userspace controller
├── bpf/ # Directory for eBPF C code
│ ├── bpf_helpers.h # Helper header
│ ├── monitor.bpf.c # Our eBPF program
│ └── vmlinux.h # Kernel type definitions for CO-RE
└── Makefile # To compile the eBPF C code
To generate vmlinux.h for CO-RE, you'll need bpftool:
bpftool btf dump file /sys/kernel/btf/vmlinux format c > bpf/vmlinux.h
Our Makefile will use clang to compile the C code into an eBPF object file.
# Makefile
CLANG ?= clang
LLC ?= llc
GO_APP := monitor
EBPF_SRC := ./bpf/monitor.bpf.c
EBPF_OBJ := ./bpf/monitor.bpf.o
.PHONY: all clean
all: $(GO_APP)
$(EBPF_OBJ): $(EBPF_SRC)
$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_x86 \
-I./bpf \
-c $(EBPF_SRC) -o $(EBPF_OBJ)
$(GO_APP): main.go $(EBPF_OBJ)
go build -o $(GO_APP) main.go
clean:
rm -f $(GO_APP) $(EBPF_OBJ)
Phase 1: Detecting Unauthorized Process Execution
Our first goal is to trace every execve syscall across the system, capture the filename being executed, and send it to our Go application for analysis.
The eBPF Program (`monitor.bpf.c`)
We'll attach a kprobe to the sys_execve syscall. When triggered, our BPF program will read the filename argument and submit it to a perf event buffer.
// bpf/monitor.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#define TASK_COMM_LEN 16
#define MAX_FILENAME_LEN 256
// Event structure sent to userspace
struct exec_event {
u32 pid;
u32 ppid;
u64 cgroup_id;
char comm[TASK_COMM_LEN];
char filename[MAX_FILENAME_LEN];
};
// Perf event map to send data to userspace
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} exec_events SEC(".maps");
SEC("kprobe/sys_execve")
int BPF_KPROBE(handle_execve, const char __user *filename)
{
struct exec_event event = {};
u64 id = bpf_get_current_pid_tgid();
u32 pid = id >> 32;
// Get parent PID
struct task_struct *task = (struct task_struct*)bpf_get_current_task();
event.ppid = BPF_CORE_READ(task, real_parent, tgid);
event.pid = pid;
event.cgroup_id = bpf_get_current_cgroup_id();
bpf_get_current_comm(&event.comm, sizeof(event.comm));
bpf_probe_read_user_str(&event.filename, sizeof(event.filename), filename);
// Submit the event to the perf buffer
bpf_perf_event_output(ctx, &exec_events, BPF_F_CURRENT_CPU, &event, sizeof(event));
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Advanced Implementation Details:
* vmlinux.h and BPF_CORE_READ: We are using BTF (BPF Type Format) and CO-RE. Instead of including dozens of kernel headers, vmlinux.h provides all kernel type definitions. BPF_CORE_READ is a macro that allows safe, portable access to kernel struct fields. This makes our program resilient to changes in kernel data structures across different Linux versions.
* bpf_get_current_cgroup_id(): Relying on PID alone is insufficient in a containerized environment due to PID namespace virtualization and PID wrapping. The cgroup ID is a much more stable identifier for a container. We will use this in userspace to associate events with specific containers.
* bpf_probe_read_user_str(): This helper safely copies the filename from userspace memory (where the syscall arguments reside) into our BPF program's stack. It's crucial for preventing kernel panics from invalid user pointers.
The Go Userspace Controller (`main.go` - Part 1)
Now, let's write the Go code to load this eBPF program and process the events.
We will use the cilium/ebpf library, which provides excellent Go bindings for interacting with the eBPF subsystem.
// main.go
package main
import (
"bytes"
"encoding/binary"
"errors"
"log"
"os"
"os/signal"
"syscall"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/perf"
"github.com/cilium/ebpf/rlimit"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall -D__TARGET_ARCH_x86" bpf ./bpf/monitor.bpf.c -- -I./bpf
const (
taskCommLen = 16
maxFilenameLen = 256
)
// This mirrors the C struct
type execEvent struct {
PID uint32
PPID uint32
CgroupID uint64
Comm [taskCommLen]byte
Filename [maxFilenameLen]byte
}
func main() {
// Allow the current process to lock memory for eBPF maps.
if err := rlimit.RemoveMemlock(); err != nil {
log.Fatal(err)
}
// Load pre-compiled programs and maps into the kernel.
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading objects: %v", err)
}
defer objs.Close()
// Attach the kprobe for execve
kp, err := link.Kprobe("sys_execve", objs.HandleExecve, nil)
if err != nil {
log.Fatalf("attaching kprobe: %s", err)
}
defer kp.Close()
log.Println("eBPF programs attached. Waiting for events...")
// Set up a PerfReader to read events from the perf buffer map
execRd, err := perf.NewReader(objs.ExecEvents, os.Getpagesize())
if err != nil {
log.Fatalf("creating perf event reader: %s", err)
}
defer execRd.Close()
// Set up a channel to handle OS signals
stopper := make(chan os.Signal, 1)
signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
go handleExecEvents(execRd)
// Wait for a signal
<-stopper
log.Println("Received signal, exiting...")
}
func handleExecEvents(rd *perf.Reader) {
// This is our rudimentary security policy: an allowlist of binaries.
// In a real system, this would be dynamically configured per-container.
allowedBinaries := map[string]bool{
"/usr/bin/ls": true,
"/usr/bin/cat": true,
"/usr/bin/ps": true,
"/bin/busybox": true,
}
var event execEvent
for {
record, err := rd.Read()
if err != nil {
if errors.Is(err, perf.ErrClosed) {
return
}
log.Printf("reading from perf buffer: %s", err)
continue
}
if record.LostSamples > 0 {
log.Printf("perf buffer dropped %d samples", record.LostSamples)
continue
}
if err := binary.Read(bytes.NewReader(record.RawSample), binary.LittleEndian, &event); err != nil {
log.Printf("parsing perf event: %s", err)
continue
}
filename := string(bytes.TrimRight(event.Filename[:], "\x00"))
comm := string(bytes.TrimRight(event.Comm[:], "\x00"))
// Apply security policy
if !allowedBinaries[filename] {
log.Printf(
"SECURITY ALERT: Unauthorized execution detected! PID: %d, PPID: %d, Cgroup: %d, Comm: %s, Filename: %s",
event.PID,
event.PPID,
event.CgroupID,
comm,
filename,
)
} else {
log.Printf("Authorized execution: PID: %d, Filename: %s", event.PID, filename)
}
}
}
Running the Example:
make to compile both the eBPF program and the Go controller.sudo ./monitor.- In another terminal, execute some commands:
* ls / (Should be logged as allowed)
* ps aux (Should be logged as allowed)
* nmap localhost (If nmap is not in our allowlist, this will trigger a security alert).
# Output from ./monitor
INFO[0000] eBPF programs attached. Waiting for events...
INFO[0005] Authorized execution: PID: 12345, Filename: /usr/bin/ls
ERRO[0010] SECURITY ALERT: Unauthorized execution detected! PID: 12348, PPID: 5678, Cgroup: 67231, Comm: bash, Filename: /usr/bin/nmap
Phase 2: Detecting Malicious Network Connections
Now, let's extend our monitor to detect suspicious outbound TCP connections. We'll hook into the tcp_v4_connect kernel function, extract the destination IP and port, and check it against a blocklist.
Extending the eBPF Program (`monitor.bpf.c`)
We add a new event struct and a new kprobe. Extracting network information is more complex as it involves traversing nested kernel structs.
// Additions to bpf/monitor.bpf.c
// ... (previous code for exec_event and exec_events map) ...
// Event structure for network connections
struct net_event {
u32 pid;
u64 cgroup_id;
u32 daddr; // Destination IPv4 address
u16 dport; // Destination port
char comm[TASK_COMM_LEN];
};
// Perf event map for network events
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} net_events SEC(".maps");
SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(handle_tcp_connect, struct sock *sk)
{
struct net_event event = {};
u64 id = bpf_get_current_pid_tgid();
event.pid = id >> 32;
event.cgroup_id = bpf_get_current_cgroup_id();
bpf_get_current_comm(&event.comm, sizeof(event.comm));
// Read destination address and port. This is the advanced part.
// BPF_CORE_READ is essential for portability here.
event.daddr = BPF_CORE_READ(sk, __sk_common.skc_daddr);
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
// The port is in network byte order, we'll convert it in userspace.
bpf_perf_event_output(ctx, &net_events, BPF_F_CURRENT_CPU, &event, sizeof(event));
return 0;
}
Advanced Implementation Details:
Hooking tcp_v4_connect: We are attaching to the kernel function that implements the connect syscall for TCPv4. The first argument is a struct sock , which contains all the information about the socket.
* Navigating struct sock: The socket structure is one of the most complex in the kernel. Fields like skc_daddr (destination address) are nested within __sk_common. Without CO-RE and BPF_CORE_READ, we would need to hardcode struct offsets, which would break with any kernel update. This is a classic example of why CO-RE is a production requirement for eBPF.
Extending the Go Controller (`main.go`)
We'll add a new event handler for network events.
// Additions to main.go
// ... (imports and execEvent struct) ...
// Add the netEvent struct
type netEvent struct {
PID uint32
CgroupID uint64
DAddr uint32
DPort uint16
Comm [taskCommLen]byte
}
func main() {
// ... (setup code as before) ...
// Attach kprobe for execve (same as before)
kp_exec, err := link.Kprobe("sys_execve", objs.HandleExecve, nil)
if err != nil {
log.Fatalf("attaching execve kprobe: %s", err)
}
defer kp_exec.Close()
// Attach kprobe for tcp_v4_connect
kp_net, err := link.Kprobe("tcp_v4_connect", objs.HandleTcpConnect, nil)
if err != nil {
log.Fatalf("attaching tcp_connect kprobe: %s", err)
}
defer kp_net.Close()
log.Println("eBPF programs attached. Waiting for events...")
// Set up PerfReader for exec events
execRd, err := perf.NewReader(objs.ExecEvents, os.Getpagesize())
if err != nil {
log.Fatalf("creating exec perf reader: %s", err)
}
defer execRd.Close()
// Set up PerfReader for net events
netRd, err := perf.NewReader(objs.NetEvents, os.Getpagesize())
if err != nil {
log.Fatalf("creating net perf reader: %s", err)
}
defer netRd.Close()
stopper := make(chan os.Signal, 1)
signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
go handleExecEvents(execRd) // From Part 1
go handleNetEvents(netRd) // New handler
<-stopper
log.Println("Received signal, exiting...")
}
// ... (handleExecEvents function as before) ...
func handleNetEvents(rd *perf.Reader) {
// A blocklist of known malicious IPs.
// In a real system, this would be fed from a threat intelligence source.
maliciousIPs := map[string]bool{
"1.2.3.4": true, // Example C2 server
"8.8.8.8": false, // Benign, but good for testing
}
var event netEvent
for {
record, err := rd.Read()
if err != nil {
if errors.Is(err, perf.ErrClosed) {
return
}
log.Printf("reading from net perf buffer: %s", err)
continue
}
if record.LostSamples > 0 {
log.Printf("net perf buffer dropped %d samples", record.LostSamples)
continue
}
if err := binary.Read(bytes.NewReader(record.RawSample), binary.LittleEndian, &event); err != nil {
log.Printf("parsing net perf event: %s", err)
continue
}
// Convert IP and Port to human-readable format
destIP := intToIP(event.DAddr).String()
destPort := binary.BigEndian.Uint16([]byte{byte(event.DPort >> 8), byte(event.DPort)})
comm := string(bytes.TrimRight(event.Comm[:], "\x00"))
// Apply security policy
if maliciousIPs[destIP] {
log.Printf(
"SECURITY ALERT: Malicious outbound connection detected! PID: %d, Cgroup: %d, Comm: %s, Destination: %s:%d",
event.PID,
event.CgroupID,
comm,
destIP,
destPort,
)
}
}
}
// Helper to convert uint32 IP to net.IP
func intToIP(ipInt uint32) string {
// The IP address is in little-endian from the kernel struct
// so we need to reverse the byte order for the typical big-endian representation.
b := make([]byte, 4)
b[0] = byte(ipInt)
b[1] = byte(ipInt >> 8)
b[2] = byte(ipInt >> 16)
b[3] = byte(ipInt >> 24)
return fmt.Sprintf("%d.%d.%d.%d", b[0], b[1], b[2], b[3])
}
Now, when you run sudo ./monitor and execute curl 1.2.3.4 in another terminal, you'll see the security alert for the malicious connection.
Edge Cases and Production Hardening
The examples above work, but deploying them in a high-traffic production environment requires addressing several critical edge cases.
1. The High-Volume Syscall Problem (The "Thirsty Syscall")
Syscalls like execve and connect are relatively infrequent. But what if you wanted to monitor read or write for data exfiltration? These can occur millions of times per second, overwhelming the perf buffer and the userspace controller. Sending every event is not feasible.
Solution: In-Kernel Aggregation with BPF Maps
Instead of sending an event for every syscall, we can use a BPF hash map to count syscalls per process inside the kernel. We then periodically read this map from userspace.
Example: Counting openat syscalls (eBPF C code)
// A map to store counts: key=pid, value=count
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10240);
__type(key, u32);
__type(value, u64);
} syscall_counts SEC(".maps");
SEC("kprobe/sys_openat")
int BPF_KPROBE(handle_openat)
{
u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *count;
count = bpf_map_lookup_elem(&syscall_counts, &pid);
if (count) {
__sync_fetch_and_add(count, 1);
} else {
u64 init_val = 1;
bpf_map_update_elem(&syscall_counts, &pid, &init_val, BPF_ANY);
}
return 0;
}
In userspace, you would then have a goroutine that iterates over this map every few seconds, reads the counts, and resets them. This reduces data transfer from millions of events per second to a few kilobytes every polling interval.
2. The Verifier Gauntlet: Writing Safe eBPF Code
The eBPF verifier is a static analyzer in the kernel that ensures your eBPF program is safe to run. It checks for unbounded loops, out-of-bounds memory access, and null pointer dereferences. Writing verifier-friendly code is an art.
Common Pitfall: Unbounded String Reads
bpf_probe_read_user_str() can be rejected by the verifier if it thinks the source string could be too long. Always provide a compile-time constant for the size argument.
Common Pitfall: Complex Loops
The verifier must be able to prove that all loops will terminate. For this reason, loops must have a constant upper bound known at compile time.
// Verifier will accept this
#pragma unroll
for (int i = 0; i < 10; i++) {
// ...
}
3. Deployment in Kubernetes
To monitor all containers on a cluster, this agent must run on every node. The standard pattern is to deploy it as a DaemonSet.
daemonset.yaml Snippet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebpf-security-monitor
spec:
# ...
template:
spec:
hostPID: true # Required to see PIDs outside the container's namespace
containers:
- name: monitor
image: my-registry/ebpf-monitor:latest
securityContext:
privileged: true # Simplest way, but risky.
# Better: use specific capabilities
# capabilities:
# add: ["SYS_ADMIN", "BPF"]
volumeMounts:
- name: bpf-fs
mountPath: /sys/fs/bpf
volumes:
- name: bpf-fs
hostPath:
path: /sys/fs/bpf
Key Considerations:
* Permissions: eBPF requires powerful capabilities. CAP_SYS_ADMIN and CAP_BPF are typically needed. Running as a privileged container is common but should be done with extreme caution.
* BPF FS: The agent needs access to the BPF filesystem (/sys/fs/bpf) to pin maps and programs, allowing them to persist even if the userspace agent restarts.
Performance Considerations: eBPF vs. The World
The primary advantage of eBPF is its performance. Let's consider a hypothetical benchmark on a node handling 10,000 requests per second, each request triggering several execve and connect calls.
| Method | CPU Overhead (on Node) | Latency Impact (per request) | Intrusiveness | Visibility |
|---|---|---|---|---|
| eBPF Kprobes | < 1-2% | < 1µs | Very Low | Kernel-level |
ptrace (e.g., strace) | 10-500% (catastrophic) | 100s of µs to ms | High | Syscall-level |
| Sidecar Proxy (Istio) | 5-15% | 1-5ms | Medium | Network-level |
| LD_PRELOAD | 2-10% | 10-50µs | High | Libc-level |
As you can see, eBPF provides the best visibility-to-performance ratio by a significant margin. The overhead is orders of magnitude lower than ptrace and significantly less than even highly optimized service mesh proxies, all while providing deeper insights into system behavior.
Conclusion
eBPF is not just another tool; it's a fundamental capability of the modern Linux kernel that enables a new generation of highly performant and deeply insightful security and observability tools. By hooking directly into kernel operations, we can build runtime security monitors that are more efficient, more comprehensive, and less intrusive than any preceding technology.
We have demonstrated a practical, CO-RE based implementation for detecting unauthorized process execution and malicious network activity. We also explored critical production concerns like handling high-volume events, navigating the eBPF verifier, and deploying within a Kubernetes cluster. While we've only scratched the surface, these patterns form the foundation of powerful, next-generation cloud-native security solutions.