eBPF-Powered Observability: Low-Overhead Tracing in K8s with Cilium

October 3, 2025

14 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Performance Bottleneck of Sidecar Observability

In modern microservices architectures running on Kubernetes, observability is non-negotiable. However, the de facto standard—the sidecar proxy model popularized by service meshes like Istio—comes with a well-documented performance cost. Every packet originating from or destined for your application pod must traverse a user-space proxy. This round trip involves multiple context switches between user space and kernel space, memory copy operations, and the inherent processing latency of the proxy itself. For high-throughput, low-latency services, this can add milliseconds to your p99 latency and significantly increase CPU and memory footprints across the cluster.

The fundamental issue is the data path. A typical sidecar flow looks like this:

Application -> Pod Network Namespace (veth) -> Host Network Namespace -> Sidecar Proxy (User Space) -> Host Network Namespace -> Destination

This model, while powerful for traffic management and policy enforcement, is suboptimal for pure observability. We are paying a continuous performance tax for data that is often just being observed, not mutated.

eBPF (extended Berkeley Packet Filter) offers a paradigm shift. By attaching small, sandboxed programs to various hook points within the Linux kernel, we can achieve similar visibility directly at the source, eliminating the user-space detour. Cilium leverages eBPF to create a networking, security, and observability data plane that operates almost entirely within the kernel.

This article will demonstrate how to harness this power, focusing on practical, production-level techniques for deep system tracing with minimal overhead.

Section 1: Architecting for Kernel-Level Visibility

Before diving into commands, it's crucial to understand the architectural differences. Cilium's observability tool, Hubble, uses eBPF programs attached to kernel hooks like Traffic Control (TC) and socket operations to capture network flow data.

L3/L4 Visibility: An eBPF program on the TC ingress/egress hooks of a network device (like a veth pair) can inspect every packet. It can see source/destination IP, port, and TCP flags. This is done before the packet is even handed to the pod's network stack, making it incredibly efficient.

L7 Visibility: This is where the magic lies. Instead of terminating TLS and parsing traffic in a user-space proxy, Cilium attaches eBPF programs (specifically, kprobes) to the read/write syscalls within common TLS libraries (e.g., OpenSSL, GnuTLS). This allows it to inspect the unencrypted data just before it's encrypted by the application's TLS library or just after it's decrypted. This provides L7 visibility (HTTP paths, gRPC methods, Kafka topics) without the overhead and complexity of certificate management and TLS termination in a sidecar.

Production-Grade Cilium & Hubble Configuration

We assume a running Kubernetes cluster. A default Cilium installation is insufficient; we need to enable Hubble with its UI and metrics endpoints. Below is a production-oriented Helm configuration snippet.

yaml

# values-production.yaml

# Enable Kubernetes without kube-proxy for maximum performance
# This lets Cilium manage all service routing via eBPF maps.
kubeProxyReplacement: strict

# BPF-based host routing for pods
hostServices:
  enabled: true

# Enable BPF masquerading for traffic leaving the cluster
bpf:
  masquerade: true

# Hubble Configuration
hubble:
  enabled: true
  # Deploy Hubble Relay for cluster-wide flow aggregation
  relay:
    enabled: true
    # Tune buffer sizes for high-traffic clusters
    # Default is 4095; increase if you see dropped flows
    bufferSize: 8191
  # Deploy the UI for visual inspection
  ui:
    enabled: true
  # Enable metrics for Prometheus integration
  metrics:
    enabled:
      - "dns"
      - "drop"
      - "tcp"
      - "flow"
      - "port-distribution"
      - "icmp"
      - "http"

# Prometheus Integration
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true # For Prometheus Operator

# Operator Configuration
operator:
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true

To apply this configuration:

bash

helm repo add cilium https://helm.cilium.io/

helm install cilium cilium/cilium --version 1.12.5 \
  --namespace kube-system \
  -f values-production.yaml

Key Production Considerations from this configuration:

kubeProxyReplacement: strict: This is a critical performance optimization. It removes iptables from the service routing path entirely. Cilium uses eBPF hash maps to perform NAT for Kubernetes Services, which is significantly faster and more scalable than sequential iptables rule processing.

hubble.relay.enabled: true: In a multi-node cluster, the Hubble daemon on each node only sees flows on that node. Hubble Relay aggregates these flows, providing a single API endpoint for cluster-wide observability. Without it, you'd have to query each node's agent individually.

Section 2: Advanced Flow Tracing with the Hubble CLI

The Hubble CLI is your primary tool for real-time debugging. Let's move beyond hubble observe and into complex scenarios.

First, let's set up a sample application. We'll use a paymentservice and a currencyservice.

yaml

# sample-app.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: hipstershop
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: paymentservice
  namespace: hipstershop
  labels:
    app: paymentservice
spec:
  replicas: 1
  selector:
    matchLabels:
      app: paymentservice
  template:
    metadata:
      labels:
        app: paymentservice
    spec:
      containers:
      - name: server
        image: gcr.io/google-samples/microservices-demo/paymentservice:v0.3.8
        ports:
        - containerPort: 50051
--- 
apiVersion: v1
kind: Service
metadata:
  name: paymentservice
  namespace: hipstershop
spec:
  type: ClusterIP
  selector:
    app: paymentservice
  ports:
  - port: 50051
    targetPort: 50051
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: currencyservice
  namespace: hipstershop
  labels:
    app: currencyservice
spec:
  replicas: 1
  selector:
    matchLabels:
      app: currencyservice
  template:
    metadata:
      labels:
        app: currencyservice
    spec:
      containers:
      - name: server
        image: gcr.io/google-samples/microservices-demo/currencyservice:v0.3.8
        ports:
        - containerPort: 7000
--- 
apiVersion: v1
kind: Service
metadata:
  name: currencyservice
  namespace: hipstershop
spec:
  type: ClusterIP
  selector:
    app: currencyservice
  ports:
  - port: 7000
    targetPort: 7000

Apply it: kubectl apply -f sample-app.yaml

Scenario 1: Debugging Dropped Packets due to a Network Policy

Let's create a restrictive CiliumNetworkPolicy that denies traffic.

yaml

# deny-policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "deny-all-ingress"
  namespace: hipstershop
spec:
  endpointSelector:
    matchLabels:
      app: paymentservice
  ingress: [] # Empty ingress means deny all

Apply it: kubectl apply -f deny-policy.yaml

Now, if we try to connect from currencyservice to paymentservice, it will fail. A simple ping won't work, but how do we debug this at the network layer?

bash

# Exec into the currencyservice pod
CURRENCY_POD=$(kubectl get pods -n hipstershop -l app=currencyservice -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it -n hipstershop $CURRENCY_POD -- /bin/sh

# Inside the pod, try to connect (this will hang)
apk add --no-cache curl
curl -v paymentservice:50051

Now, from another terminal, use Hubble to see exactly why it's failing.

bash

# Port-forward the hubble-relay service
kubectl port-forward -n kube-system svc/hubble-relay 4245:80 &

# Use hubble observe to find the drop
hubble observe --namespace hipstershop --verdict DROPPED --to-pod paymentservice -f

Expected Output & Analysis:

text

TIME                 SOURCE -> DESTINATION                                   VERDICT     REASON
Oct 26 12:35:10.123  hipstershop/currencyservice-5f... (10.0.1.45) -> hipstershop/paymentservice-6c... (10.0.1.99:50051)   DROPPED     Policy denied

The output is unambiguous. We see:

VERDICT: DROPPED

REASON: Policy denied

This confirms a network policy is the culprit. The key performance insight here is that this filtering and logging happened entirely in the kernel. The packet was dropped by the eBPF program on the destination pod's network interface; it never even reached the pod's network stack, let alone an iptables chain. This is extremely efficient for enforcing firewall rules.

Scenario 2: Identity-Based vs. IP-Based Filtering

Cilium assigns a security identity (a numeric ID) to each endpoint based on its labels. Policies are then enforced based on these identities, not ephemeral pod IPs. This is a more robust and scalable model.

Let's find the identity of our pods:

bash

# Get the Cilium Endpoint for the currency service
CILIUM_EP=$(kubectl get cep -n hipstershop -l app=currencyservice -o jsonpath='{.items[0].metadata.name}')

# Describe the endpoint to get its identity
kubectl describe cep -n hipstershop $CILIUM_EP

# Look for a line like: Identity: ID=43128, Labels: [k8s:app=currencyservice, ...]

Let's say the identity is 43128. We can now use this for highly specific tracing.

bash

# Trace all traffic originating from any pod with this identity
hubble observe --from-identity 43128

# This is far more powerful than IP-based filtering, as pods can be rescheduled and get new IPs,
# but their label-based identity remains the same.

Section 3: L7 Observability without Sidecars (HTTP/gRPC)

This is where Cilium + eBPF truly outshines traditional methods for pure observability.

First, let's fix our network policy to allow traffic.

yaml

# allow-policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "allow-currency-to-payment"
  namespace: hipstershop
spec:
  endpointSelector:
    matchLabels:
      app: paymentservice
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: currencyservice

Delete the old policy and apply the new one:

kubectl delete cnp -n hipstershop deny-all-ingress

kubectl apply -f allow-policy.yaml

Our sample app uses gRPC. Let's generate some traffic.

bash

# We'll deploy a load generator
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

Now, let's inspect the gRPC calls between services.

bash

# Observe gRPC traffic to the paymentservice
hubble observe --namespace hipstershop --protocol grpc --to-pod paymentservice -f

Expected Output & Analysis:

text

TIME                 SOURCE -> DESTINATION                                   TYPE        VERDICT
Oct 26 12:45:20.555  hipstershop/checkoutservice-7d... -> hipstershop/paymentservice-6c... (50051)   gRPC        FORWARDED (gRPC) {call:"hipstershop.PaymentService/Charge", authority:":authority: paymentservice:50051"}

Notice the rich L7 information: gRPC {call:"hipstershop.PaymentService/Charge"}. Hubble's eBPF programs, attached to the socket functions, have parsed the gRPC frame to extract the service and method name. This was achieved without a sidecar, without terminating TLS, and without any application code changes.

Edge Case: Statically Compiled Go Binaries

A common edge case for L7 parsing is with statically compiled Go applications that don't use the system's shared C libraries for TLS (like OpenSSL). By default, Cilium's kprobes look for symbols in these shared libraries. If your Go app is built like this, Cilium's automatic L7 parsing might fail.

Solution: You need to provide user-space probing information via Cilium's configuration or annotations on the pod. This tells the Cilium agent where to find the TLS function symbols within the statically linked binary. This is an advanced procedure and requires analyzing the binary with tools like nm to find the correct function offsets.

Example annotation (conceptual):

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-static-go-app
  annotations:
    # This is a conceptual example; the exact annotation may vary
    "cilium.io/tls-probe.go-crypto/tls": "/path/to/binary:main.FunctionName"
spec:
  # ...

Section 4: Integrating with Prometheus for Long-Term Analysis

Real-time CLI observation is for debugging. For monitoring and alerting, we need to integrate with a TSDB like Prometheus.

Our values-production.yaml already enabled the metrics endpoints and created a ServiceMonitor. If you have the Prometheus Operator installed, it will automatically scrape Cilium and Hubble.

Let's explore some powerful PromQL queries you can build.

1. HTTP Latency Golden Signals (without a sidecar)

Hubble can export HTTP request/response metrics. You can calculate p95 latency between two services.

promql

# p99 latency for HTTP GET requests from frontend to productcatalogservice

histogram_quantile(0.99,
  sum(rate(hubble_http_response_latency_seconds_bucket{
    namespace="hipstershop",
    source_app="frontend",
    destination_app="productcatalogservice",
    method="GET"
  }[5m])) by (le, source_app, destination_app)
)

2. Network Policy Drop Rate by Reason

This is invaluable for security monitoring. You can alert if a specific application suddenly starts seeing a high rate of policy denials.

promql

# Rate of dropped packets to the paymentservice, broken down by drop reason

sum(rate(hubble_drop_total{
  namespace="hipstershop",
  destination_app="paymentservice"
}[5m])) by (reason)

3. DNS Resolution Failures per Source App

Cilium can also parse DNS requests/responses, giving you insight into service discovery issues.

promql

# Rate of DNS queries with RCODE != NoError, indicating an error

sum(rate(hubble_dns_responses_total{
  namespace="hipstershop",
  rcode!="NoError"
}[5m])) by (source_app)

These metrics provide a comprehensive view of your network's health and security posture, all sourced directly from the kernel with minimal overhead.

Section 5: Advanced Troubleshooting with `cilium policy trace`

Sometimes, hubble observe tells you a packet was dropped, but your network policies are complex, and you don't know which rule is the cause. The cilium policy trace command is a simulator that lets you determine the policy verdict for a hypothetical packet.

Let's go back to our deny-all-ingress scenario. We know traffic from currencyservice to paymentservice is being dropped. Let's prove it with the tracer.

First, we need the security identities and pod IPs.

bash

# Get source pod info
SOURCE_POD_NAME=$(kubectl get pod -n hipstershop -l app=currencyservice -o jsonpath='{.items[0].metadata.name}')
SOURCE_POD_IP=$(kubectl get pod -n hipstershop $SOURCE_POD_NAME -o jsonpath='{.status.podIP}')
SOURCE_IDENTITY=$(kubectl get cep -n hipstershop -l app=currencyservice -o jsonpath='{.items[0].status.identity.id}')

# Get destination pod info
DEST_POD_NAME=$(kubectl get pod -n hipstershop -l app=paymentservice -o jsonpath='{.items[0].metadata.name}')
DEST_POD_IP=$(kubectl get pod -n hipstershop $DEST_POD_NAME -o jsonpath='{.status.podIP}')
DEST_IDENTITY=$(kubectl get cep -n hipstershop -l app=paymentservice -o jsonpath='{.items[0].status.identity.id}')

Now, run the trace from one of the Cilium agent pods. Find a cilium pod on the same node as the destination pod for the most accurate trace.

bash

# Find a cilium agent pod
CILIUM_POD=$(kubectl get pods -n kube-system -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}')

# Execute the policy trace command inside the cilium agent
kubectl exec -it -n kube-system $CILIUM_POD -- \
  cilium policy trace \
    --src-identity $SOURCE_IDENTITY \
    --src-ip $SOURCE_POD_IP \
    --dst-identity $DEST_IDENTITY \
    --dst-ip $DEST_POD_IP \
    -d 50051/TCP

Expected Output & Analysis:

text

-> Verdict: Denied
  Source Identity: 43128 -> hipstershop/currencyservice
  Destination Identity: 12945 -> hipstershop/paymentservice
  Traffic: TCP port 50051
  Policy: 
    hipstershop/deny-all-ingress (Ingress)
      Enforced: Yes
      Rule:       (no rules matched)

This output is the ultimate debugging tool. It tells you:

The final verdict is Denied.

It explicitly names the policy responsible: hipstershop/deny-all-ingress.

It even tells you that the denial was because no rules within that policy matched the allowed traffic profile.

This level of introspection allows you to resolve complex, multi-policy interaction issues with confidence, without having to guess which policy is at fault.

Conclusion: The Future is Kernel-Native

For senior engineers optimizing for performance, reliability, and security in Kubernetes, moving observability out of user-space sidecars and into the kernel via eBPF is a logical and powerful evolution. Cilium provides a mature, production-ready implementation of this vision.

By leveraging kernel-native data collection, we achieve:

Reduced Latency: Eliminating the user-space proxy detour for every network packet significantly lowers communication overhead.

Lower Resource Consumption: Fewer moving parts (no sidecar per pod) means less CPU and memory usage across the cluster.

Simplified Architecture: The operational burden of injecting, managing, and updating sidecars is removed.

Deep, Context-Aware Visibility: Combining L3/L4 flow data with L7 protocol parsing and identity-based context provides richer insights than IP-based tools.

While the sidecar model still holds value for complex traffic-shifting and routing use cases, for pure observability, the performance and efficiency of eBPF are undeniable. As eBPF's capabilities continue to expand with projects like Tetragon for runtime security, the kernel is solidifying its place as the next frontier for cloud-native observability and security.

The Performance Bottleneck of Sidecar Observability

Section 1: Architecting for Kernel-Level Visibility

Production-Grade Cilium & Hubble Configuration

Section 2: Advanced Flow Tracing with the Hubble CLI

Scenario 1: Debugging Dropped Packets due to a Network Policy

Scenario 2: Identity-Based vs. IP-Based Filtering

Section 3: L7 Observability without Sidecars (HTTP/gRPC)

Edge Case: Statically Compiled Go Binaries

Section 4: Integrating with Prometheus for Long-Term Analysis

Section 5: Advanced Troubleshooting with `cilium policy trace`

Conclusion: The Future is Kernel-Native

Found this article helpful?