eBPF for Zero-Overhead APM in Kubernetes Pods
The Performance Tax of Traditional APM
For senior engineers building and maintaining distributed systems in Kubernetes, Application Performance Monitoring (Apm) is a non-negotiable requirement. However, existing solutions impose a significant performance tax. Language-specific agents (e.g., Java agents, Python decorators) perform bytecode manipulation or monkey-patching, adding CPU overhead and memory pressure within the application process itself. Service mesh sidecars like Istio or Linkerd, while powerful, introduce an extra network hop for every request, adding latency and consuming substantial resources per pod.
These methods instrument the application from the inside or intercept its traffic from the outside. Both approaches create a trade-off between observability depth and application performance. In high-throughput, low-latency services, this trade-off is often unacceptable.
eBPF (extended Berkeley Packet Filter) offers a revolutionary alternative. By running sandboxed programs within the Linux kernel, eBPF can attach to nearly any event—syscalls, function entries/exits, network packets—and collect data with negligible overhead. It allows us to build a powerful APM solution that instruments the application from the outside-in, observing its behavior from the kernel level without modifying application code or its runtime environment. This article will demonstrate how to implement such a system, focusing on the practical challenges and advanced techniques required for production environments.
The Core Challenge: Tracing User-Space Applications from the Kernel
The fundamental problem we must solve is bridging the gap between the kernel and user-space. How can an eBPF program, running in the privileged context of the kernel, gain insight into a specific function call within a Go microservice running as a regular process inside a container?
The answer lies in a combination of kernel and user-space probes.
kprobe) and exit (kretprobe) of almost any function within the kernel. For APM, they are invaluable for tracing syscalls related to I/O, such as read, write, sendto, and recvfrom, giving us a precise view of network and disk activity.uprobes can be attached to the entry (uprobe) and exit (uretprobe) of functions in any user-space application. This allows us to target a specific HTTP handler in a Go binary, a request processing method in a Java application, or a database query function in a Python service.The Symbol Resolution Hurdle with Statically-Linked Binaries
Attaching a uprobe requires knowing the memory address of the target function within the process's virtual address space. For dynamically-linked C/C++ applications, this is relatively straightforward. The dynamic linker resolves symbols, and tools can easily find the offset of a function within a shared library.
Go, however, produces statically-linked binaries by default. All necessary library code is compiled directly into the final executable. This presents a challenge: the function's address isn't in a separate .so file but is part of the main binary. To attach a uprobe, our APM tool must be able to parse the Go binary's symbol table to find the offset for a function like main.handleGetUser.
This is a non-trivial task that production eBPF-based APM tools like Pixie or Cilium's Hubble have solved robustly. For our purposes, we'll use tools from the BCC (BPF Compiler Collection) suite, which can handle this symbol resolution for us, provided the binary is not fully stripped of its symbol table.
Practical Implementation: Tracing a Go Microservice's HTTP Handlers
Let's build a concrete example. We will monitor a simple Go microservice, capture the latency of a specific HTTP handler, and extract the request path—all using an eBPF program without touching the Go code.
Step 1: The Target Go Microservice
Here is our simple web server. It has two endpoints: /health for basic checks and /api/user/:id which simulates work with a sleep.
main.go
package main
import (
	"fmt"
	"log"
	"net/http"
	"strconv"
	"time"
	"github.com/gorilla/mux"
)
func handleGetUser(w http.ResponseWriter, r *http.Request) {
	// Simulate work, e.g., a database query
	vars := mux.Vars(r)
	idStr := vars["id"]
	id, err := strconv.Atoi(idStr)
	if err != nil {
		w.WriteHeader(http.StatusBadRequest)
		fmt.Fprintf(w, "Invalid user ID")
		return
	}
	// Simulate a random delay between 50ms and 250ms
	delay := time.Duration(50+id%200) * time.Millisecond
	time.Sleep(delay)
	w.WriteHeader(http.StatusOK)
	fmt.Fprintf(w, "User data for ID %d after %s delay", id, delay)
}
func handleHealthCheck(w http.ResponseWriter, r *http.Request) {
	w.WriteHeader(http.StatusOK)
	fmt.Fprintf(w, "OK")
}
func main() {
	r := mux.NewRouter()
	r.HandleFunc("/api/user/{id:[0-9]+}", handleGetUser).Methods("GET")
	r.HandleFunc("/health", handleHealthCheck).Methods("GET")
	log.Println("Starting server on :8080")
	if err := http.ListenAndServe(":8080", r); err != nil {
		log.Fatalf("Could not start server: %s\n", err)
	}
}To build this for our Kubernetes environment, we use a Dockerfile. Critically, we are not stripping the binary (-s -w linker flags are omitted) so that our eBPF tool can find the function symbols.
Dockerfile
FROM golang:1.21-alpine
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY *.go ./
# Build the Go app. Do NOT use -ldflags="-s -w" which strips symbols.
RUN go build -o /app/server .
EXPOSE 8080
CMD ["/app/server"]Build and push this image to a registry accessible by your Kubernetes cluster.
Step 2: The eBPF Tracing Program (using BCC)
We will now write a Python script using the BCC framework to load and manage our eBPF program. This script will perform the following actions:
- Define an eBPF program in C.
uprobe to the entry of main.handleGetUser.uretprobe to the exit of the same function.- At entry, store the current timestamp in a BPF map, keyed by the thread ID.
- At exit, retrieve the start timestamp, calculate the duration, and send the result to user-space via a perf buffer.
trace_http.py
#!/usr/bin/python3
from bcc import BPF
import argparse
import ctypes as ct
# Argument parsing
parser = argparse.ArgumentParser(
    description="Trace Go HTTP server requests",
    formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("-p", "--pid", type=int, help="Process ID to trace")
parser.add_argument("-b", "--binary", type=str, help="Path to the Go binary")
args = parser.parse_args()
if not args.pid or not args.binary:
    print("PID and binary path are required.")
    exit(1)
# eBPF C program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
// Data structure to be sent to user-space
struct data_t {
    u64 pid_tgid;
    u64 duration_ns;
    char comm[TASK_COMM_LEN];
};
BPF_HASH(start, u64);
BPF_PERF_OUTPUT(events);
// uprobe on function entry
int trace_http_handler_entry(struct pt_regs *ctx) {
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    start.update(&pid_tgid, &ts);
    return 0;
}
// uretprobe on function exit
int trace_http_handler_exit(struct pt_regs *ctx) {
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u64 *tsp = start.lookup(&pid_tgid);
    if (tsp == 0) {
        // Could not find start time, probably missed the entry probe
        return 0;
    }
    struct data_t data = {};
    data.pid_tgid = pid_tgid;
    data.duration_ns = bpf_ktime_get_ns() - *tsp;
    bpf_get_current_comm(&data.comm, sizeof(data.comm));
    // Send data to user-space via perf buffer
    events.perf_submit(ctx, &data, sizeof(data));
    start.delete(&pid_tgid);
    return 0;
}
"""
# Load the eBPF program
b = BPF(text=bpf_text)
# Attach uprobes
function_name = "main.handleGetUser"
b.attach_uprobe(name=args.binary, sym=function_name, fn_name="trace_http_handler_entry", pid=args.pid)
b.attach_uretprobe(name=args.binary, sym=function_name, fn_name="trace_http_handler_exit", pid=args.pid)
print(f"Tracing {function_name} in PID {args.pid}... Press Ctrl+C to end.")
# Define the data structure in Python for parsing
class Data(ct.Structure):
    _fields_ = [("pid_tgid", ct.c_ulonglong),
                ("duration_ns", ct.c_ulonglong),
                ("comm", ct.c_char * 16)] # TASK_COMM_LEN
def print_event(cpu, data, size):
    event = ct.cast(data, ct.POINTER(Data)).contents
    duration_ms = event.duration_ns / 1_000_000.0
    print(f"[PID: {event.pid_tgid >> 32}] Function '{function_name}' took {duration_ms:.3f} ms")
# Open perf buffer and start polling
b["events"].open_perf_buffer(print_event)
while True:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()Step 3: Deployment in Kubernetes
To run this tracer in Kubernetes, we need a pod that has sufficient privileges to load eBPF programs and can access the host's process space. A DaemonSet is the ideal controller for this, ensuring our tracer runs on every node.
This DaemonSet pod will need:
*   hostPID: true: To see processes from other pods on the same node.
*   securityContext: { privileged: true }: To get the CAP_BPF and CAP_SYS_ADMIN capabilities required for loading eBPF programs.
* Volume mounts for kernel headers and debugfs.
Here is a manifest for our Go application deployment and the tracer DaemonSet.
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-server-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: go-server
  template:
    metadata:
      labels:
        app: go-server
    spec:
      containers:
      - name: server
        image: your-registry/go-ebpf-demo:latest # <-- REPLACE WITH YOUR IMAGE
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: go-server-service
spec:
  selector:
    app: go-server
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: http-tracer-ds
  labels:
    app: http-tracer
spec:
  selector:
    matchLabels:
      app: http-tracer
  template:
    metadata:
      labels:
        app: http-tracer
    spec:
      hostPID: true
      tolerations:
      - operator: Exists
      containers:
      - name: tracer
        image: quay.io/iovisor/bcc:latest # A pre-built image with BCC tools
        securityContext:
          privileged: true
        # This command is a hack to find the PID and binary path and run our script.
        # In a production system, this would be a more robust discovery agent.
        command: ["/bin/bash", "-c"]
        args:
        - |
          apt-get update && apt-get install -y procps
          echo "Finding Go server process..."
          TARGET_PID=$(pgrep -n server)
          if [ -z "$TARGET_PID" ]; then
            echo "Go server process not found. Exiting."
            exit 1
          fi
          echo "Found PID: $TARGET_PID"
          BINARY_PATH=/proc/$TARGET_PID/exe
          echo "Binary path: $BINARY_PATH"
          # Copy our script and run it
          /usr/share/bcc/tools/trace_http.py --pid $TARGET_PID --binary $BINARY_PATH
        volumeMounts:
        - name: bcc-script
          mountPath: /usr/share/bcc/tools/trace_http.py
          subPath: trace_http.py
        - name: lib-modules
          mountPath: /lib/modules
          readOnly: true
        - name: sys-kernel-debug
          mountPath: /sys/kernel/debug
      volumes:
      - name: bcc-script
        configMap:
          name: tracer-script-cm
      - name: lib-modules
        hostPath:
          path: /lib/modules
      - name: sys-kernel-debug
        hostPath:
          path: /sys/kernel/debug
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: tracer-script-cm
data:
  trace_http.py: |-
    # Paste the full content of trace_http.py here
    # ... (omitted for brevity)After applying this YAML (and creating the ConfigMap with the Python script), the DaemonSet will start a pod on the same node as your Go application. The startup command will find the PID of the server process and its binary path, then execute our tracing script.
Now, if you send requests to the go-server-service and check the logs of the http-tracer-ds pod, you will see the latency measurements:
# Send some traffic
$ curl http://<LOAD_BALANCER_IP>/api/user/10
$ curl http://<LOAD_BALANCER_IP>/api/user/150
# Check tracer logs
$ kubectl logs -f ds/http-tracer-ds
...
Finding Go server process...
Found PID: 12345
Binary path: /proc/12345/exe
Tracing main.handleGetUser in PID 12345... Press Ctrl+C to end.
[PID: 12345] Function 'main.handleGetUser' took 60.123 ms
[PID: 12345] Function 'main.handleGetUser' took 200.456 msWe have successfully implemented application-level tracing with zero changes to the application code.
Advanced Scenario: Correlating Application and Network Traces
Capturing function latency is powerful, but in a real-world scenario, we need more context. For example, was the latency in the function itself, or was it due to slow network I/O when writing the response? To answer this, we need to correlate our uprobe events with kprobe events on network syscalls.
The challenge is state management. How does the kprobe on tcp_sendmsg know that it's being called in the context of the main.handleGetUser function?
The answer is to use the thread ID (TGID/PID) as a correlation key. The execution flow is as follows:
uprobe on main.handleGetUser fires. We store a marker in a BPF map: active_requests[thread_id] = 1.kprobe on tcp_sendmsg fires. We check if active_requests[thread_id] exists. If it does, we know this network I/O is happening on behalf of our traced function. We can then record data about the syscall (e.g., bytes sent).uretprobe on main.handleGetUser fires. We clean up the map by deleting the entry: active_requests.delete(thread_id).Let's enhance our BCC script to capture this correlation.
trace_advanced.py (partial update)
eBPF C Code:
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <net/sock.h>
// Map to track active threads inside our target function
BPF_HASH(active_requests, u64);
// Data for function latency
struct http_data_t {
    u64 pid_tgid;
    u64 duration_ns;
};
BPF_HASH(start, u64);
BPF_PERF_OUTPUT(http_events);
// Data for network events
struct net_data_t {
    u64 pid_tgid;
    u64 size;
};
BPF_PERF_OUTPUT(net_events);
int trace_http_handler_entry(struct pt_regs *ctx) {
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    start.update(&pid_tgid, &ts);
    active_requests.update(&pid_tgid, &ts); // Mark this thread as active
    return 0;
}
int trace_http_handler_exit(struct pt_regs *ctx) {
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u64 *tsp = start.lookup(&pid_tgid);
    if (tsp == 0) { return 0; }
    struct http_data_t data = {};
    data.pid_tgid = pid_tgid;
    data.duration_ns = bpf_ktime_get_ns() - *tsp;
    http_events.perf_submit(ctx, &data, sizeof(data));
    start.delete(&pid_tgid);
    active_requests.delete(&pid_tgid); // Unmark thread
    return 0;
}
// kprobe on tcp_sendmsg
int trace_tcp_send(struct pt_regs *ctx, struct sock *sk, struct msghdr *msg, size_t size) {
    u64 pid_tgid = bpf_get_current_pid_tgid();
    // Is this thread in our target function? 
    if (active_requests.lookup(&pid_tgid) == 0) {
        return 0; // Not a thread we care about
    }
    struct net_data_t data = {};
    data.pid_tgid = pid_tgid;
    data.size = size;
    net_events.perf_submit(ctx, &data, sizeof(data));
    return 0;
}Python Script Updates:
In the Python script, we would attach this new kprobe and set up a separate perf buffer and callback function to handle the network events.
# ... (previous BPF text loaded)
b = BPF(text=bpf_text)
# Attach uprobes as before
b.attach_uprobe(name=args.binary, sym="main.handleGetUser", fn_name="trace_http_handler_entry", pid=args.pid)
b.attach_uretprobe(name=args.binary, sym="main.handleGetUser", fn_name="trace_http_handler_exit", pid=args.pid)
# Attach kprobe for network tracing
b.attach_kprobe(event="tcp_sendmsg", fn_name="trace_tcp_send")
# ... (define data structures for both event types)
def print_http_event(cpu, data, size):
    # ... (as before)
def print_net_event(cpu, data, size):
    event = ct.cast(data, ct.POINTER(NetData)).contents
    print(f"  [PID: {event.pid_tgid >> 32}] >> TCP send of {event.size} bytes during request")
# Open both perf buffers
b["http_events"].open_perf_buffer(print_http_event)
b["net_events"].open_perf_buffer(print_net_event)
while True:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()With this change, the tracer's output would be enriched, showing network activity directly correlated with the application function call:
[PID: 12345] >> TCP send of 85 bytes during request
[PID: 12345] >> TCP send of 42 bytes during request
[PID: 12345] Function 'main.handleGetUser' took 200.456 msThis correlated data is the foundation of advanced APM, allowing engineers to precisely attribute latency to either application logic or I/O operations.
Edge Cases and Production Considerations
While powerful, this approach has complexities that must be addressed in a production system.
-ldflags="-s -w", the symbol table is stripped, and attach_uprobe by symbol name will fail. Production systems must handle this. One advanced technique is to use DWARF debug information if available. Another is to rely on User-level Statically Defined Tracing (USDT) probes, which are explicit markers compiled into the application that eBPF can hook into. USDT probes are more stable across builds than function offsets but require developers to add them to the code.BPF_PERF_OUTPUT can become significant. The data copy and context switching can start to impact performance. The modern solution is to use BPF_RINGBUF, a more efficient, lock-free, multi-producer/single-consumer ring buffer available in newer kernels (5.8+). For even higher frequency events, in-kernel aggregation is preferred: use a BPF map to build histograms of latencies directly in the kernel and only send the aggregated data to user-space periodically.libbpf library. CO-RE uses BTF (BPF Type Format) to understand the layout of kernel structures on the target host at runtime, allowing a single, pre-compiled BPF program to adapt itself and run across a wide range of kernel versions without needing headers.*http.Request argument in our Go function requires understanding Go's calling convention (which registers hold which arguments) and then carefully using bpf_probe_read_user() to dereference pointers and read the string data from user-space memory. This is complex and fragile.Conclusion: The Future of Observability is in the Kernel
Leveraging eBPF for APM represents a paradigm shift in observability. By moving instrumentation out of the application and into the kernel, we can achieve a level of detail previously impossible without incurring a significant performance penalty. This approach decouples observability from application development, allowing platform and SRE teams to deploy rich, consistent monitoring across a diverse fleet of microservices without requiring buy-in or code changes from dozens of development teams.
While the examples here used BCC for its simplicity, building a robust, production-ready system requires embracing libbpf and CO-RE, handling the complexities of symbol resolution, and implementing efficient in-kernel data aggregation. Projects like Cilium/Tetragon, Pixie, and Calico are already demonstrating the power of this approach at scale. For senior engineers tasked with building the next generation of observability tooling, mastering eBPF is no longer optional—it is the key to unlocking truly zero-overhead, deeply insightful performance monitoring in the cloud-native era.