eBPF for Zero-Overhead APM in Kubernetes Pods

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Performance Tax of Traditional APM

For senior engineers building and maintaining distributed systems in Kubernetes, Application Performance Monitoring (Apm) is a non-negotiable requirement. However, existing solutions impose a significant performance tax. Language-specific agents (e.g., Java agents, Python decorators) perform bytecode manipulation or monkey-patching, adding CPU overhead and memory pressure within the application process itself. Service mesh sidecars like Istio or Linkerd, while powerful, introduce an extra network hop for every request, adding latency and consuming substantial resources per pod.

These methods instrument the application from the inside or intercept its traffic from the outside. Both approaches create a trade-off between observability depth and application performance. In high-throughput, low-latency services, this trade-off is often unacceptable.

eBPF (extended Berkeley Packet Filter) offers a revolutionary alternative. By running sandboxed programs within the Linux kernel, eBPF can attach to nearly any event—syscalls, function entries/exits, network packets—and collect data with negligible overhead. It allows us to build a powerful APM solution that instruments the application from the outside-in, observing its behavior from the kernel level without modifying application code or its runtime environment. This article will demonstrate how to implement such a system, focusing on the practical challenges and advanced techniques required for production environments.


The Core Challenge: Tracing User-Space Applications from the Kernel

The fundamental problem we must solve is bridging the gap between the kernel and user-space. How can an eBPF program, running in the privileged context of the kernel, gain insight into a specific function call within a Go microservice running as a regular process inside a container?

The answer lies in a combination of kernel and user-space probes.

  • kprobes (Kernel Probes): These are the most common type of eBPF attachment points. They can be attached to the entry (kprobe) and exit (kretprobe) of almost any function within the kernel. For APM, they are invaluable for tracing syscalls related to I/O, such as read, write, sendto, and recvfrom, giving us a precise view of network and disk activity.
  • uprobes (User-space Probes): This is where the magic happens for application-specific tracing. uprobes can be attached to the entry (uprobe) and exit (uretprobe) of functions in any user-space application. This allows us to target a specific HTTP handler in a Go binary, a request processing method in a Java application, or a database query function in a Python service.
  • The Symbol Resolution Hurdle with Statically-Linked Binaries

    Attaching a uprobe requires knowing the memory address of the target function within the process's virtual address space. For dynamically-linked C/C++ applications, this is relatively straightforward. The dynamic linker resolves symbols, and tools can easily find the offset of a function within a shared library.

    Go, however, produces statically-linked binaries by default. All necessary library code is compiled directly into the final executable. This presents a challenge: the function's address isn't in a separate .so file but is part of the main binary. To attach a uprobe, our APM tool must be able to parse the Go binary's symbol table to find the offset for a function like main.handleGetUser.

    This is a non-trivial task that production eBPF-based APM tools like Pixie or Cilium's Hubble have solved robustly. For our purposes, we'll use tools from the BCC (BPF Compiler Collection) suite, which can handle this symbol resolution for us, provided the binary is not fully stripped of its symbol table.


    Practical Implementation: Tracing a Go Microservice's HTTP Handlers

    Let's build a concrete example. We will monitor a simple Go microservice, capture the latency of a specific HTTP handler, and extract the request path—all using an eBPF program without touching the Go code.

    Step 1: The Target Go Microservice

    Here is our simple web server. It has two endpoints: /health for basic checks and /api/user/:id which simulates work with a sleep.

    main.go

    go
    package main
    
    import (
    	"fmt"
    	"log"
    	"net/http"
    	"strconv"
    	"time"
    
    	"github.com/gorilla/mux"
    )
    
    func handleGetUser(w http.ResponseWriter, r *http.Request) {
    	// Simulate work, e.g., a database query
    	vars := mux.Vars(r)
    	idStr := vars["id"]
    	id, err := strconv.Atoi(idStr)
    	if err != nil {
    		w.WriteHeader(http.StatusBadRequest)
    		fmt.Fprintf(w, "Invalid user ID")
    		return
    	}
    
    	// Simulate a random delay between 50ms and 250ms
    	delay := time.Duration(50+id%200) * time.Millisecond
    	time.Sleep(delay)
    
    	w.WriteHeader(http.StatusOK)
    	fmt.Fprintf(w, "User data for ID %d after %s delay", id, delay)
    }
    
    func handleHealthCheck(w http.ResponseWriter, r *http.Request) {
    	w.WriteHeader(http.StatusOK)
    	fmt.Fprintf(w, "OK")
    }
    
    func main() {
    	r := mux.NewRouter()
    	r.HandleFunc("/api/user/{id:[0-9]+}", handleGetUser).Methods("GET")
    	r.HandleFunc("/health", handleHealthCheck).Methods("GET")
    
    	log.Println("Starting server on :8080")
    	if err := http.ListenAndServe(":8080", r); err != nil {
    		log.Fatalf("Could not start server: %s\n", err)
    	}
    }

    To build this for our Kubernetes environment, we use a Dockerfile. Critically, we are not stripping the binary (-s -w linker flags are omitted) so that our eBPF tool can find the function symbols.

    Dockerfile

    dockerfile
    FROM golang:1.21-alpine
    
    WORKDIR /app
    
    COPY go.mod go.sum ./
    RUN go mod download
    
    COPY *.go ./
    
    # Build the Go app. Do NOT use -ldflags="-s -w" which strips symbols.
    RUN go build -o /app/server .
    
    EXPOSE 8080
    
    CMD ["/app/server"]

    Build and push this image to a registry accessible by your Kubernetes cluster.

    Step 2: The eBPF Tracing Program (using BCC)

    We will now write a Python script using the BCC framework to load and manage our eBPF program. This script will perform the following actions:

    • Define an eBPF program in C.
  • Attach a uprobe to the entry of main.handleGetUser.
  • Attach a uretprobe to the exit of the same function.
    • At entry, store the current timestamp in a BPF map, keyed by the thread ID.
    • At exit, retrieve the start timestamp, calculate the duration, and send the result to user-space via a perf buffer.

    trace_http.py

    python
    #!/usr/bin/python3
    
    from bcc import BPF
    import argparse
    import ctypes as ct
    
    # Argument parsing
    parser = argparse.ArgumentParser(
        description="Trace Go HTTP server requests",
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument("-p", "--pid", type=int, help="Process ID to trace")
    parser.add_argument("-b", "--binary", type=str, help="Path to the Go binary")
    args = parser.parse_args()
    
    if not args.pid or not args.binary:
        print("PID and binary path are required.")
        exit(1)
    
    # eBPF C program
    bpf_text = """
    #include <uapi/linux/ptrace.h>
    #include <linux/sched.h>
    
    // Data structure to be sent to user-space
    struct data_t {
        u64 pid_tgid;
        u64 duration_ns;
        char comm[TASK_COMM_LEN];
    };
    
    BPF_HASH(start, u64);
    BPF_PERF_OUTPUT(events);
    
    // uprobe on function entry
    int trace_http_handler_entry(struct pt_regs *ctx) {
        u64 pid_tgid = bpf_get_current_pid_tgid();
        u64 ts = bpf_ktime_get_ns();
        start.update(&pid_tgid, &ts);
        return 0;
    }
    
    // uretprobe on function exit
    int trace_http_handler_exit(struct pt_regs *ctx) {
        u64 pid_tgid = bpf_get_current_pid_tgid();
        u64 *tsp = start.lookup(&pid_tgid);
    
        if (tsp == 0) {
            // Could not find start time, probably missed the entry probe
            return 0;
        }
    
        struct data_t data = {};
        data.pid_tgid = pid_tgid;
        data.duration_ns = bpf_ktime_get_ns() - *tsp;
        bpf_get_current_comm(&data.comm, sizeof(data.comm));
    
        // Send data to user-space via perf buffer
        events.perf_submit(ctx, &data, sizeof(data));
    
        start.delete(&pid_tgid);
        return 0;
    }
    """
    
    # Load the eBPF program
    b = BPF(text=bpf_text)
    
    # Attach uprobes
    function_name = "main.handleGetUser"
    b.attach_uprobe(name=args.binary, sym=function_name, fn_name="trace_http_handler_entry", pid=args.pid)
    b.attach_uretprobe(name=args.binary, sym=function_name, fn_name="trace_http_handler_exit", pid=args.pid)
    
    print(f"Tracing {function_name} in PID {args.pid}... Press Ctrl+C to end.")
    
    # Define the data structure in Python for parsing
    class Data(ct.Structure):
        _fields_ = [("pid_tgid", ct.c_ulonglong),
                    ("duration_ns", ct.c_ulonglong),
                    ("comm", ct.c_char * 16)] # TASK_COMM_LEN
    
    def print_event(cpu, data, size):
        event = ct.cast(data, ct.POINTER(Data)).contents
        duration_ms = event.duration_ns / 1_000_000.0
        print(f"[PID: {event.pid_tgid >> 32}] Function '{function_name}' took {duration_ms:.3f} ms")
    
    # Open perf buffer and start polling
    b["events"].open_perf_buffer(print_event)
    while True:
        try:
            b.perf_buffer_poll()
        except KeyboardInterrupt:
            exit()

    Step 3: Deployment in Kubernetes

    To run this tracer in Kubernetes, we need a pod that has sufficient privileges to load eBPF programs and can access the host's process space. A DaemonSet is the ideal controller for this, ensuring our tracer runs on every node.

    This DaemonSet pod will need:

    * hostPID: true: To see processes from other pods on the same node.

    * securityContext: { privileged: true }: To get the CAP_BPF and CAP_SYS_ADMIN capabilities required for loading eBPF programs.

    * Volume mounts for kernel headers and debugfs.

    Here is a manifest for our Go application deployment and the tracer DaemonSet.

    deployment.yaml

    yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: go-server-deployment
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: go-server
      template:
        metadata:
          labels:
            app: go-server
        spec:
          containers:
          - name: server
            image: your-registry/go-ebpf-demo:latest # <-- REPLACE WITH YOUR IMAGE
            ports:
            - containerPort: 8080
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: go-server-service
    spec:
      selector:
        app: go-server
      ports:
      - protocol: TCP
        port: 80
        targetPort: 8080
      type: LoadBalancer
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: http-tracer-ds
      labels:
        app: http-tracer
    spec:
      selector:
        matchLabels:
          app: http-tracer
      template:
        metadata:
          labels:
            app: http-tracer
        spec:
          hostPID: true
          tolerations:
          - operator: Exists
          containers:
          - name: tracer
            image: quay.io/iovisor/bcc:latest # A pre-built image with BCC tools
            securityContext:
              privileged: true
            # This command is a hack to find the PID and binary path and run our script.
            # In a production system, this would be a more robust discovery agent.
            command: ["/bin/bash", "-c"]
            args:
            - |
              apt-get update && apt-get install -y procps
              echo "Finding Go server process..."
              TARGET_PID=$(pgrep -n server)
              if [ -z "$TARGET_PID" ]; then
                echo "Go server process not found. Exiting."
                exit 1
              fi
              echo "Found PID: $TARGET_PID"
              BINARY_PATH=/proc/$TARGET_PID/exe
              echo "Binary path: $BINARY_PATH"
              # Copy our script and run it
              /usr/share/bcc/tools/trace_http.py --pid $TARGET_PID --binary $BINARY_PATH
            volumeMounts:
            - name: bcc-script
              mountPath: /usr/share/bcc/tools/trace_http.py
              subPath: trace_http.py
            - name: lib-modules
              mountPath: /lib/modules
              readOnly: true
            - name: sys-kernel-debug
              mountPath: /sys/kernel/debug
          volumes:
          - name: bcc-script
            configMap:
              name: tracer-script-cm
          - name: lib-modules
            hostPath:
              path: /lib/modules
          - name: sys-kernel-debug
            hostPath:
              path: /sys/kernel/debug
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: tracer-script-cm
    data:
      trace_http.py: |-
        # Paste the full content of trace_http.py here
        # ... (omitted for brevity)

    After applying this YAML (and creating the ConfigMap with the Python script), the DaemonSet will start a pod on the same node as your Go application. The startup command will find the PID of the server process and its binary path, then execute our tracing script.

    Now, if you send requests to the go-server-service and check the logs of the http-tracer-ds pod, you will see the latency measurements:

    bash
    # Send some traffic
    $ curl http://<LOAD_BALANCER_IP>/api/user/10
    $ curl http://<LOAD_BALANCER_IP>/api/user/150
    
    # Check tracer logs
    $ kubectl logs -f ds/http-tracer-ds
    ...
    Finding Go server process...
    Found PID: 12345
    Binary path: /proc/12345/exe
    Tracing main.handleGetUser in PID 12345... Press Ctrl+C to end.
    [PID: 12345] Function 'main.handleGetUser' took 60.123 ms
    [PID: 12345] Function 'main.handleGetUser' took 200.456 ms

    We have successfully implemented application-level tracing with zero changes to the application code.


    Advanced Scenario: Correlating Application and Network Traces

    Capturing function latency is powerful, but in a real-world scenario, we need more context. For example, was the latency in the function itself, or was it due to slow network I/O when writing the response? To answer this, we need to correlate our uprobe events with kprobe events on network syscalls.

    The challenge is state management. How does the kprobe on tcp_sendmsg know that it's being called in the context of the main.handleGetUser function?

    The answer is to use the thread ID (TGID/PID) as a correlation key. The execution flow is as follows:

  • uprobe on main.handleGetUser fires. We store a marker in a BPF map: active_requests[thread_id] = 1.
  • kprobe on tcp_sendmsg fires. We check if active_requests[thread_id] exists. If it does, we know this network I/O is happening on behalf of our traced function. We can then record data about the syscall (e.g., bytes sent).
  • uretprobe on main.handleGetUser fires. We clean up the map by deleting the entry: active_requests.delete(thread_id).
  • Let's enhance our BCC script to capture this correlation.

    trace_advanced.py (partial update)

    eBPF C Code:

    c
    #include <uapi/linux/ptrace.h>
    #include <linux/sched.h>
    #include <net/sock.h>
    
    // Map to track active threads inside our target function
    BPF_HASH(active_requests, u64);
    
    // Data for function latency
    struct http_data_t {
        u64 pid_tgid;
        u64 duration_ns;
    };
    BPF_HASH(start, u64);
    BPF_PERF_OUTPUT(http_events);
    
    // Data for network events
    struct net_data_t {
        u64 pid_tgid;
        u64 size;
    };
    BPF_PERF_OUTPUT(net_events);
    
    int trace_http_handler_entry(struct pt_regs *ctx) {
        u64 pid_tgid = bpf_get_current_pid_tgid();
        u64 ts = bpf_ktime_get_ns();
        start.update(&pid_tgid, &ts);
        active_requests.update(&pid_tgid, &ts); // Mark this thread as active
        return 0;
    }
    
    int trace_http_handler_exit(struct pt_regs *ctx) {
        u64 pid_tgid = bpf_get_current_pid_tgid();
        u64 *tsp = start.lookup(&pid_tgid);
        if (tsp == 0) { return 0; }
    
        struct http_data_t data = {};
        data.pid_tgid = pid_tgid;
        data.duration_ns = bpf_ktime_get_ns() - *tsp;
        http_events.perf_submit(ctx, &data, sizeof(data));
    
        start.delete(&pid_tgid);
        active_requests.delete(&pid_tgid); // Unmark thread
        return 0;
    }
    
    // kprobe on tcp_sendmsg
    int trace_tcp_send(struct pt_regs *ctx, struct sock *sk, struct msghdr *msg, size_t size) {
        u64 pid_tgid = bpf_get_current_pid_tgid();
    
        // Is this thread in our target function? 
        if (active_requests.lookup(&pid_tgid) == 0) {
            return 0; // Not a thread we care about
        }
    
        struct net_data_t data = {};
        data.pid_tgid = pid_tgid;
        data.size = size;
        net_events.perf_submit(ctx, &data, sizeof(data));
    
        return 0;
    }

    Python Script Updates:

    In the Python script, we would attach this new kprobe and set up a separate perf buffer and callback function to handle the network events.

    python
    # ... (previous BPF text loaded)
    
    b = BPF(text=bpf_text)
    
    # Attach uprobes as before
    b.attach_uprobe(name=args.binary, sym="main.handleGetUser", fn_name="trace_http_handler_entry", pid=args.pid)
    b.attach_uretprobe(name=args.binary, sym="main.handleGetUser", fn_name="trace_http_handler_exit", pid=args.pid)
    
    # Attach kprobe for network tracing
    b.attach_kprobe(event="tcp_sendmsg", fn_name="trace_tcp_send")
    
    # ... (define data structures for both event types)
    
    def print_http_event(cpu, data, size):
        # ... (as before)
    
    def print_net_event(cpu, data, size):
        event = ct.cast(data, ct.POINTER(NetData)).contents
        print(f"  [PID: {event.pid_tgid >> 32}] >> TCP send of {event.size} bytes during request")
    
    # Open both perf buffers
    b["http_events"].open_perf_buffer(print_http_event)
    b["net_events"].open_perf_buffer(print_net_event)
    
    while True:
        try:
            b.perf_buffer_poll()
        except KeyboardInterrupt:
            exit()

    With this change, the tracer's output would be enriched, showing network activity directly correlated with the application function call:

    text
    [PID: 12345] >> TCP send of 85 bytes during request
    [PID: 12345] >> TCP send of 42 bytes during request
    [PID: 12345] Function 'main.handleGetUser' took 200.456 ms

    This correlated data is the foundation of advanced APM, allowing engineers to precisely attribute latency to either application logic or I/O operations.


    Edge Cases and Production Considerations

    While powerful, this approach has complexities that must be addressed in a production system.

  • Stripped Binaries and Symbol Resolution: If a Go binary is compiled with -ldflags="-s -w", the symbol table is stripped, and attach_uprobe by symbol name will fail. Production systems must handle this. One advanced technique is to use DWARF debug information if available. Another is to rely on User-level Statically Defined Tracing (USDT) probes, which are explicit markers compiled into the application that eBPF can hook into. USDT probes are more stable across builds than function offsets but require developers to add them to the code.
  • High-Frequency Events and Perf Buffer Overhead: For an endpoint hit thousands of times per second, the overhead of sending an event to user-space for every single request via BPF_PERF_OUTPUT can become significant. The data copy and context switching can start to impact performance. The modern solution is to use BPF_RINGBUF, a more efficient, lock-free, multi-producer/single-consumer ring buffer available in newer kernels (5.8+). For even higher frequency events, in-kernel aggregation is preferred: use a BPF map to build histograms of latencies directly in the kernel and only send the aggregated data to user-space periodically.
  • Kernel Version Dependencies and CO-RE: eBPF is a rapidly evolving technology. The availability of helper functions, map types, and program types is tied to the kernel version. A BPF program compiled on kernel 5.15 may not run on 4.19 because of changes to kernel data structures. This is where BCC shows its weakness; it compiles the BPF code on the target machine, requiring kernel headers to be installed. The production-grade solution is CO-RE (Compile Once - Run Everywhere), used by the libbpf library. CO-RE uses BTF (BPF Type Format) to understand the layout of kernel structures on the target host at runtime, allowing a single, pre-compiled BPF program to adapt itself and run across a wide range of kernel versions without needing headers.
  • Argument Scraping Complexity: In our example, we only captured latency. A full APM solution would need to capture the HTTP method, path, and status code. This involves reading function arguments from registers and the stack. This is highly architecture- and language-specific. For example, reading the *http.Request argument in our Go function requires understanding Go's calling convention (which registers hold which arguments) and then carefully using bpf_probe_read_user() to dereference pointers and read the string data from user-space memory. This is complex and fragile.
  • Conclusion: The Future of Observability is in the Kernel

    Leveraging eBPF for APM represents a paradigm shift in observability. By moving instrumentation out of the application and into the kernel, we can achieve a level of detail previously impossible without incurring a significant performance penalty. This approach decouples observability from application development, allowing platform and SRE teams to deploy rich, consistent monitoring across a diverse fleet of microservices without requiring buy-in or code changes from dozens of development teams.

    While the examples here used BCC for its simplicity, building a robust, production-ready system requires embracing libbpf and CO-RE, handling the complexities of symbol resolution, and implementing efficient in-kernel data aggregation. Projects like Cilium/Tetragon, Pixie, and Calico are already demonstrating the power of this approach at scale. For senior engineers tasked with building the next generation of observability tooling, mastering eBPF is no longer optional—it is the key to unlocking truly zero-overhead, deeply insightful performance monitoring in the cloud-native era.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles