eBPF Observability in K8s: Cilium & Hubble for Production

September 27, 2025

15 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond the Sidecar: The Kernel-Level Observability Revolution with eBPF

For years, the de facto standard for achieving deep observability in Kubernetes has been the service mesh, typically implemented via a sidecar proxy like Envoy (used by Istio and others). While powerful, this model introduces significant operational and performance overhead: every pod gets a dedicated user-space proxy, increasing resource consumption, adding latency to the request path, and complicating TLS and certificate management. Senior engineers managing large-scale clusters know this pain well. The trade-off between visibility and performance has always been a contentious point.

eBPF (extended Berkeley Packet Filter) fundamentally changes this equation. By running sandboxed programs directly within the Linux kernel, eBPF allows us to implement networking, security, and observability logic without modifying application code or injecting user-space proxies. Cilium leverages eBPF to provide a CNI (Container Network Interface) that operates at the kernel level, offering unparalleled performance and visibility.

This article is not an introduction to eBPF or Cilium. It's a production-focused guide for engineers who understand the fundamentals of Kubernetes networking and are evaluating or implementing Cilium for its advanced observability capabilities. We will dissect the practical implementation of Cilium and its observability component, Hubble, focusing on production-ready configurations, advanced debugging techniques, and performance tuning considerations.

The Technical Advantage: Kernel-Space vs. User-Space

Let's quantify the difference. In a sidecar model:

A packet arrives at the node's network interface.

The kernel's network stack (iptables or IPVS) redirects it to the pod's network namespace.

The packet is routed to the Envoy sidecar proxy listening on a specific port.
Envoy (in user-space) processes the packet, gathers metrics, enforces policy, and then forwards it to the application container, again via the kernel's loopback interface.

This context switching between kernel-space and user-space for every packet adds latency and consumes CPU cycles.

With Cilium's eBPF datapath:

A packet arrives at the node's network interface.
An eBPF program attached to the network driver or Traffic Control (TC) hook point inspects the packet directly in kernel-space.

The eBPF program enforces network policy, gathers metrics, and forwards the packet directly to the destination pod's network interface, bypassing much of the host's traditional networking stack (like iptables).

This process eliminates multiple context switches and user-space hops, resulting in significantly lower latency and reduced CPU overhead, especially at high throughput.

Section 1: Deploying a Production-Ready Cilium & Hubble Stack

Deploying Cilium via Helm is straightforward, but a production deployment requires careful configuration. A default helm install is insufficient. Here's a breakdown of a robust values.yaml for enabling a full observability stack.

Prerequisites

* A Kubernetes cluster (v1.23+ recommended).

* A Linux kernel that supports eBPF fully (5.10+ is ideal for all features).

* Helm v3+.

Advanced Helm Configuration (`values.yaml`)

This configuration enables Hubble for UI and CLI access, exposes Prometheus metrics for scraping, and configures the operator for high availability.

yaml

# values-production.yaml

# Use a stable, recent version of Cilium
image:
  tag: "v1.15.1"

# Configure eBPF settings for modern kernels
ebpf:
  # Mount BPF filesystem
  mountPath: /sys/fs/bpf

# Operator configuration for HA and Prometheus integration
operator:
  replicas: 2
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true # For Prometheus Operator

# Enable Hubble for deep flow visibility
hubble:
  enabled: true
  relay:
    enabled: true
    prometheus:
      serviceMonitor:
        enabled: true
  ui:
    enabled: true
  metrics:
    enabled:
      - "dns"
      - "drop"
      - "tcp"
      - "flow"
      - "port-distribution"
      - "icmp"
      - "httpV2:exemplar=true;labels=source_ip,source_namespace,source_pod,destination_ip,destination_namespace,destination_pod,traffic_direction,http_method,http_path,http_status_code"

# Enable Prometheus scraping on Cilium agents
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true

# Crucial for L7 visibility. Default is "default" which can be permissive.
# "always" ensures policies are always enforced, giving better visibility into drops.
policyEnforcementMode: "always"

# Enable L7 protocol parsing for HTTP/gRPC. This is essential for Hubble's L7 features.
# Note: This adds some overhead. We'll discuss tuning this later.
httpProtocol: true
grpcProtocol: true

# For cloud environments, set the native routing CIDR
# Example for AWS
# ipam:
#   operator:
#     clusterPoolIPv4PodCIDR: "10.0.0.0/16"

# Security context for production hardening
securityContext:
  privileged: false

# For clusters with kube-proxy disabled for maximum performance
# kubeProxyReplacement: strict

Deployment Steps

Add the Cilium Helm repository:

bash

    helm repo add cilium https://helm.cilium.io/

Install Cilium with the production values:

bash

    helm install cilium cilium/cilium --version 1.15.1 \
      -n kube-system \
      -f values-production.yaml

Verify the installation:

bash

    cilium status --wait

You should see all components healthy and the KubeProxyReplacement status reflecting your configuration.

Section 2: Deep Network Flow Analysis with Hubble

With Cilium running, Hubble becomes our microscope into the cluster's network traffic. We'll deploy a sample microservices application to demonstrate its power.

Sample Application

Let's use a simple three-tier app: frontend -> api-service -> db-service.

yaml

# sample-app.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: demo-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
  namespace: demo-app
  labels:
    app: frontend
spec:
  replicas: 1
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
      - name: frontend
        image: nginx # Simple placeholder
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: demo-app
  labels:
    app: api-service
    tier: backend
spec:
  replicas: 1
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
        tier: backend
    spec:
      containers:
      - name: api-service
        image: kennethreitz/httpbin # A useful HTTP testing service
---
apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: demo-app
spec:
  selector:
    app: api-service
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: db-service
  namespace: demo-app
  labels:
    app: db-service
    tier: database
spec:
  replicas: 1
  selector:
    matchLabels:
      app: db-service
  template:
    metadata:
      labels:
        app: db-service
        tier: database
    spec:
      containers:
      - name: db-service
        image: redis

Apply this manifest: kubectl apply -f sample-app.yaml

Advanced Hubble CLI Usage

First, port-forward the Hubble Relay service:

kubectl port-forward -n kube-system svc/hubble-relay 4245:80

Now, use the hubble CLI. Let's explore some advanced queries.

Observe all traffic within the demo-app namespace, including L7 details:

bash

    hubble observe --namespace demo-app --protocol http -o json | jq

This provides raw, detailed JSON output for every observed HTTP request, perfect for scripting or further analysis.

Identify dropped packets and their reasons:

Let's apply a restrictive network policy that only allows frontend to talk to api-service.

yaml

    # netpol.yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "api-allow-frontend"
      namespace: demo-app
    spec:
      endpointSelector:
        matchLabels:
          app: api-service
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: frontend
        toPorts:
        - ports:
          - port: "80"
            protocol: TCP
          rules:
            http:
            - method: "GET"
              path: "/get"

Apply it: kubectl apply -f netpol.yaml

Now, try to POST from frontend to api-service:

bash

    # Exec into the frontend pod
    kubectl exec -it -n demo-app $(kubectl get pods -n demo-app -l app=frontend -o name) -- bash
    
    # Inside the pod, make an allowed request
    curl http://api-service/get
    # Make a disallowed request
    curl -X POST http://api-service/post

The second curl will hang and time out.

Now, let's debug this with Hubble:

bash

    hubble observe --namespace demo-app --verdict DROPPED

Expected Output:

text

    Feb 26 15:30:10.123 [kube-system/cilium-abcd1] demo-app/frontend-xyz -> demo-app/api-service-qrs:80 http-post DROPPED (Policy denied)

This instantly tells you the source, destination, protocol, and exactly why the packet was dropped: Policy denied. This is invaluable for debugging complex network policies.

Using the Hubble UI for Visual Analysis

Port-forward to the Hubble UI service:

cilium hubble ui

This will open a browser window showing a live service map of your cluster. Here's what to look for:

* Service Map: You'll see nodes for frontend, api-service, and db-service. Allowed traffic (the GET request) will show as a solid green line. The denied POST request will appear as a flashing red line, making it immediately obvious where connectivity issues lie.

* Flow Inspection: Clicking on the line between frontend and api-service will show you a list of recent flows. You can filter for DROPPED verdicts and inspect the L7 details of the denied request, including the HTTP method (POST) and path (/post) that violated the policy.

Section 3: Production Metrics with Prometheus and Grafana

While Hubble is excellent for real-time debugging, Prometheus is the tool for long-term monitoring, alerting, and trend analysis. Cilium exposes a rich set of metrics.

Key Metrics to Monitor

Assuming you have a Prometheus Operator stack, the ServiceMonitor objects we enabled will automatically start scraping metrics. Here are the most critical ones:

* cilium_drop_count_total{reason="Policy denied"}: The absolute number one metric for network policy health. A sudden increase indicates misconfiguration or an unauthorized access attempt. You should have an alert on this.

* cilium_forward_count_total: A byte and packet counter for all forwarded traffic. Useful for understanding traffic volume between services.

* hubble_flows_processed_total: Monitors the health of Hubble itself. A drop could indicate an issue with the observability pipeline.

* hubble_http_requests_total: L7 metric giving you HTTP request counts, filterable by method, path, and status code. This provides a RED (Rate, Errors, Duration) metrics baseline without instrumenting your application.

* cilium_endpoint_regeneration_time_stats_seconds: Tracks how long it takes Cilium to regenerate an endpoint's eBPF program when its identity or policy changes. Spikes can indicate an overloaded operator or control plane issues.

Building a Production Grafana Dashboard

Here are some PromQL queries for a powerful Cilium dashboard:

Top Dropped Flows by Reason:

promql

    topk(10, sum(rate(cilium_drop_count_total[5m])) by (reason, direction, source_namespace, destination_namespace))

This gives you an immediate overview of why packets are being dropped across the cluster.

HTTP Error Rate (5xx) Per Service:

promql

    sum(rate(hubble_http_requests_total{http_status_code=~"5.."}[5m])) by (source_pod, destination_pod) 
    / 
    sum(rate(hubble_http_requests_total[5m])) by (source_pod, destination_pod)

This query leverages Hubble's L7 metrics to calculate a crucial application-level SLO without any application-side instrumentation.

Network Policy Enforcement Rate:

promql

    sum(rate(cilium_policy_enforcement_total[5m])) by (namespace, direction)

This shows how actively your network policies are being evaluated, which can be useful for identifying unexpectedly permissive or restrictive rules.

Section 4: Edge Cases & Performance Tuning

This is where senior-level operational knowledge comes into play. Running Cilium in production requires understanding its boundaries and how to tune it.

Edge Case 1: Handling Non-HTTP L7 Protocols (e.g., Kafka, Redis)

Cilium's L7 parsing is currently limited to HTTP, gRPC, and a few other known protocols. What about our db-service running Redis?

* Problem: You cannot write a CiliumNetworkPolicy that says "allow api-service to run the SET command but not FLUSHALL."

* Solution: You must fall back to L3/L4 policies. You can restrict access to the Redis pod on its port (6379) from the api-service pod. This is less granular but still provides essential network segmentation.

yaml

    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "db-allow-api-service"
      namespace: demo-app
    spec:
      endpointSelector:
        matchLabels:
          app: db-service
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: api-service
        toPorts:
        - ports:
          - port: "6379"
            protocol: TCP

Edge Case 2: The Overhead of L7 Visibility

Enabling HTTP parsing on every pod (httpProtocol: true) is great for observability but has a performance cost. The eBPF program becomes more complex, and more data needs to be sent from the kernel to the user-space Hubble agent.

* Problem: In a high-throughput cluster, full L7 parsing everywhere can consume non-trivial CPU.

* Solution: Use policy-based L7 visibility. Disable global parsing and enable it only where needed via annotations or policy.

In your values.yaml:

yaml

    # Disable global parsing
    httpProtocol: false

Then, in a CiliumNetworkPolicy, specify which protocols to parse for that specific traffic:

yaml

    ...    
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
      rules:
        http: # This enables HTTP parsing for this specific rule
        - {}

This targeted approach gives you L7 visibility on critical application endpoints while keeping the overhead minimal for high-volume, internal traffic (e.g., database connections).

Performance Tuning: eBPF Map Sizing

eBPF programs use maps to store state (e.g., connection tracking entries, policy identities). In very large clusters, the default map sizes might be insufficient.

* Symptom: You might see BPF map is full errors in the cilium-agent logs, leading to dropped connections or incorrect policy enforcement.

* Tuning Parameters (via Helm values.yaml or agent arguments):

* bpf.ctGlobalTcpMax: Controls the size of the TCP connection tracking map. Default is ~1 million. Increase this for nodes handling a massive number of concurrent TCP connections.

* bpf.natMapEntries: Size of the NAT map. Important for nodes running NodePort services or using egress gateways.

* bpf.policyMapEntries: Size of the policy map. In clusters with thousands of services and complex policies, this may need to be increased.

Example Adjustment:

yaml

    # In values.yaml
    bpf:
      ctGlobalTcpMax: 2000000
      natMapEntries: 131072

Tuning these requires careful monitoring of map pressure via cilium bpf ct list and cilium bpf nat list commands on the node.

Conclusion: A New Paradigm for Cloud-Native Observability

Transitioning from sidecar-based service meshes to an eBPF-powered datapath with Cilium is more than just a performance optimization; it's a paradigm shift in how we approach cloud-native observability. By moving visibility to the kernel, we gain a more efficient, secure, and comprehensive view of our systems.

We've covered the practical steps for a production deployment, moving beyond simple installation to advanced configuration, deep debugging with the Hubble CLI, and integrating with Prometheus for long-term monitoring. More importantly, we've addressed the critical edge cases and performance tuning considerations that separate a proof-of-concept from a production-ready system.

As eBPF continues to mature, we can expect even more capabilities to be pushed into the kernel, from advanced security monitoring with tools like Tetragon to even more efficient load balancing. For senior engineers responsible for the stability, performance, and security of large-scale Kubernetes clusters, mastering eBPF-based tools like Cilium is no longer a niche skill—it's becoming a fundamental component of the modern cloud-native stack.

Beyond the Sidecar: The Kernel-Level Observability Revolution with eBPF

The Technical Advantage: Kernel-Space vs. User-Space

Section 1: Deploying a Production-Ready Cilium & Hubble Stack

Prerequisites

Advanced Helm Configuration (`values.yaml`)

Deployment Steps

Section 2: Deep Network Flow Analysis with Hubble

Sample Application

Advanced Hubble CLI Usage

Using the Hubble UI for Visual Analysis

Section 3: Production Metrics with Prometheus and Grafana

Key Metrics to Monitor

Building a Production Grafana Dashboard

Section 4: Edge Cases & Performance Tuning

Edge Case 1: Handling Non-HTTP L7 Protocols (e.g., Kafka, Redis)

Edge Case 2: The Overhead of L7 Visibility

Performance Tuning: eBPF Map Sizing

Conclusion: A New Paradigm for Cloud-Native Observability

Found this article helpful?