Kernel-Level Service Mesh Observability with eBPF, Cilium, and Hubble

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inescapable Overhead of Sidecar Proxies

For years, the service mesh paradigm, championed by tools like Istio and Linkerd, has been synonymous with the sidecar proxy. Injecting an L7-aware proxy like Envoy into every application pod provided a powerful, transparent mechanism for traffic management, security, and observability. Senior engineers understand the benefits: mTLS, advanced routing, retries, and detailed metrics without modifying application code. However, we also bear the scars of its operational costs.

Every sidecar introduces a non-trivial resource tax. A modest 100m CPU and 128Mi memory per sidecar, multiplied across a thousand pods, results in a staggering 100 vCPUs and 128Gi of memory dedicated solely to the mesh infrastructure. More critically, each packet must traverse the network stack multiple times: from the kernel to the sidecar's user-space process, and then back to the kernel before reaching the application's user-space process. This round trip adds measurable latency, a critical concern for high-performance, low-latency services. The complexity of managing sidecar lifecycle, injection, and versioning further compounds the operational burden.

This article is for engineers who have faced these challenges and are seeking a more efficient, performant alternative. We will explore how eBPF, through the lens of Cilium, fundamentally alters the service mesh data plane, moving observability from a costly user-space abstraction to an efficient, native kernel-level function.

Architectural Divergence: eBPF vs. Sidecar Data Path

To appreciate the shift, let's contrast the journey of a network packet in both architectures.

Sidecar Proxy (e.g., Istio with Envoy):

  • Egress: An application in Pod A sends a request to Service B.
  • The request hits the pod's network namespace and is intercepted by iptables rules injected by the mesh.
  • iptables redirects the packet to the Envoy sidecar process within Pod A.
    • The packet traverses from kernel-space to user-space to reach Envoy.
    • Envoy applies its L7 logic (routing, metrics, tracing headers, mTLS encryption).
  • Envoy sends the packet back to the kernel to be routed to Pod B.
  • Ingress: The packet arrives at Pod B's network namespace.
  • iptables rules again intercept the packet, redirecting it to Pod B's Envoy sidecar.
    • The packet again moves from kernel-space to user-space.
    • Envoy decrypts the mTLS, validates the request, and gathers metrics.
  • Finally, Envoy forwards the packet via the loopback interface to the application process in Pod B.
  • This path involves at least four kernel-to-user-space transitions and two full proxy hops, adding latency at each step.

    eBPF Data Plane (Cilium):

  • Egress: An application in Pod A sends a request to Service B.
  • The sendmsg() or send() syscall is invoked. An eBPF program attached to the socket hook (cgroup/connect4 or similar) intercepts the call within the kernel.
    • Cilium's eBPF program understands Kubernetes services and endpoints via eBPF maps populated by the Cilium agent.
  • It performs service-to-backend translation directly in the kernel, selecting the destination Pod B's IP.
    • If network policies are in place, they are evaluated and enforced at this stage.
  • The packet is sent directly from Pod A's kernel network stack to Pod B's, without ever entering a user-space proxy.
  • Ingress: The packet arrives at the network interface on Pod B's node. An eBPF program attached at the Traffic Control (TC) layer processes the packet, enforces ingress policies, and delivers it directly to the application's socket.
  • All observability data (source/destination IP, port, identity, verdict, L7 metadata for supported protocols) is collected by these eBPF programs and written to a performant, per-CPU eBPF ring buffer, from which the Cilium agent and Hubble can consume it with minimal overhead.

    This fundamental difference eliminates the per-packet latency tax and resource overhead of sidecars, forming the basis of our exploration.


    Production-Grade Cilium Configuration for Observability

    A default Cilium installation provides basic CNI functionality. To unlock its full observability potential for a production environment, we must enable and tune specific features. Below is an advanced Helm values.yaml snippet, followed by a breakdown of why each configuration is critical.

    yaml
    # values-production-observability.yaml
    
    # Enable Hubble for deep flow visibility
    hubble:
      enabled: true
      # Listen on all interfaces for hubble CLI access
      listenAddress: ":4244"
      # Enable TLS for Hubble server/CLI communication
      tls:
        auto:
          method: helm
      # Deploy Hubble Relay for cluster-wide visibility from a single endpoint
      relay:
        enabled: true
      # Deploy the Hubble UI for visualization and exploration
      ui:
        enabled: true
    
    # Core CNI and eBPF settings
    # Replace kube-proxy entirely for full eBPF data path visibility
    kubeProxyReplacement: strict
    
    # Enable BPF masquerading for traffic leaving the cluster
    bpf:
      masquerade: true
    
    # Pre-allocate eBPF maps for performance and stability under load
    # Sizing depends on cluster scale. Monitor map pressure.
    preallocateBPFMaps: true
    
    # L7 Visibility Configuration
    # Enable L7 policy enforcement to parse application protocols
    policyEnforcementMode: "default"
    
    # Enable Prometheus metrics for Cilium and Hubble
    prometheus:
      enabled: true
    serviceMonitor:
      enabled: true # If using Prometheus Operator
    
    # Enable Hubble's metrics endpoint
    metrics:
      enabled:
        - "dns:query;ignoreAAAA"
        - "drop"
        - "tcp"
        - "flow"
        - "port-distribution"
        - "icmp"
        - "httpV2:exemplars=true;labels=path,method,status"

    Dissecting the Configuration:

    * hubble.relay.enabled: true: In a multi-node cluster, each Cilium agent has its own Hubble instance observing local flows. Hubble Relay aggregates these distributed streams into a single, cluster-wide gRPC API. This is essential for getting a complete picture of traffic without having to query each node's agent individually.

    kubeProxyReplacement: strict: This is arguably the most critical setting for a pure eBPF data path. It instructs Cilium to completely take over ClusterIP, NodePort, and LoadBalancer service implementation using eBPF programs and maps. By removing kube-proxy (and its underlying iptables or IPVS rules), we ensure that all* service traffic is processed by Cilium's eBPF programs, making it visible to Hubble. Without this, traffic might bypass Cilium's hooks, leading to blind spots in observability.

    * preallocateBPFMaps: true: eBPF maps are key-value stores in the kernel used by Cilium to maintain state (e.g., service endpoints, connection tracking entries, policy identities). By default, some maps grow dynamically. Pre-allocating them on agent startup prevents potential allocation failures under heavy load or memory pressure, leading to a more stable and predictable data plane. Monitoring map pressure (cilium bpf map list) is a key operational task.

    * policyEnforcementMode: "default": While this sounds like a security feature, it's crucial for L7 observability. When enabled, Cilium can be instructed via CiliumNetworkPolicy to parse specific protocols like HTTP, Kafka, gRPC, etc. This parsing populates Hubble flows with rich L7 metadata (e.g., HTTP method, path, status code) without requiring a full L7 proxy.

    * metrics.enabled: This configures Hubble's own Prometheus metrics endpoint. The httpV2 line is particularly powerful. It enables detailed HTTP metrics, and exemplars=true links these metrics back to specific traces in systems like Grafana Tempo, providing a seamless jump from a metric spike to the exact flows that caused it.

    Applying this configuration via Helm provides a robust foundation for the advanced debugging and auditing scenarios that follow.

    bash
    helm install cilium cilium/cilium --version 1.15.1 \
      -n kube-system \
      -f values-production-observability.yaml

    Advanced Debugging with the Hubble CLI

    The Hubble UI is excellent for discovery, but the hubble CLI is the power tool for automation, scripting, and deep, repeatable analysis. We'll walk through a common, complex production scenario.

    Scenario 1: Isolating Latency in a Microservice Call Chain

    Setup:

    * A frontend service calls a product-api service.

    * The product-api service calls a database service.

    * Users are reporting intermittent slowness on the frontend.

    * All services are deployed in the production namespace.

    Problem: The overall request latency is high, but we don't know if the bottleneck is frontend -> product-api network latency, product-api processing time, or product-api -> database latency.

    Solution using Hubble CLI:

    First, let's observe the flows between the frontend and the product-api. We'll target a specific pod to reduce noise.

    bash
    # 1. Get the pod names
    FRONTEND_POD=$(kubectl get pods -n production -l app=frontend -o jsonpath='{.items[0].metadata.name}')
    PRODUCT_API_POD=$(kubectl get pods -n production -l app=product-api -o jsonpath='{.items[0].metadata.name}')
    
    # 2. Observe flows from frontend to product-api, focusing on HTTP
    # The -o jsoncompact provides detailed, newline-delimited JSON output
    hubble observe -n production --from-pod ${FRONTEND_POD} --to-pod ${PRODUCT_API_POD} --protocol http -o jsoncompact

    Sample JSONCompact Output and Analysis:

    json
    {"flow":{"time":"2023-10-27T10:30:01.123456789Z","verdict":"FORWARDED","source":{"identity":123,"namespace":"production","pod_name":"frontend-xyz"},"destination":{"identity":456,"namespace":"production","pod_name":"product-api-abc"},"Type":2,"l4":{"TCP":{"source_port":54321,"destination_port":8080}},"l7":{"http":{"method":"GET","url":"/products/123","status":200}}},"node_name":"node-1"}
    {"flow":{"time":"2023-10-27T10:30:01.567890123Z","verdict":"FORWARDED","source":{"identity":123,"namespace":"production","pod_name":"frontend-xyz"},"destination":{"identity":456,"namespace":"production","pod_name":"product-api-abc"},"Type":2,"l4":{"TCP":{"source_port":54322,"destination_port":8080}},"l7":{"http":{"method":"GET","url":"/products/456","status":503}}},"node_name":"node-1"}

    From this, we can immediately see:

    • The exact timestamps of each request and response flow.
  • The L7 details: GET /products/123 succeeded (status: 200), but GET /products/456 failed (status: 503).
    • The source and destination identities, confirming the traffic path.

    But this doesn't show us the latency. For that, we need to look at the TCP-level flows. Let's refine the query to look for slow connections by examining the time delta between TCP flags.

    bash
    # Observe all TCP flows for this connection, not just the L7-parsed ones
    hubble observe -n production --from-pod ${FRONTEND_POD} --to-pod ${PRODUCT_API_POD} --to-port 8080 -o json | jq -c '.flow | {time, l4.TCP.flags, verdict}'

    Interpreting TCP Flag Timestamps for Latency:

    By analyzing the timestamps of flows with specific TCP flags, we can break down latency:

  • SYN -> SYN_ACK: This delta represents the network round-trip time (RTT) and the server's TCP stack processing time for establishing a connection. A high value here points to network congestion or a heavily loaded destination node.
  • Request Packet (PSH, ACK) -> Response Packet (PSH, ACK): This is the application processing time. It's the time from when the last packet of the request was sent until the first packet of the response was received. This is our product-api's internal processing time.
  • FIN -> FIN_ACK: This measures the time to tear down the connection.
  • If we see a large gap (e.g., >500ms) between the request PSH, ACK and the response PSH, ACK, the problem is likely within the product-api's business logic. We can then repeat the same analysis for the product-api -> database connection to further isolate the issue.

    Advanced Edge Case: HTTP/2 and gRPC

    For multiplexed protocols like HTTP/2 or gRPC, a single TCP connection carries multiple streams. In this case, analyzing TCP flags is insufficient. This is where Cilium's L7 parsing shines. The httpV2 metrics we enabled earlier will provide per-stream latency histograms, which can be queried via Prometheus, pinpointing latency for specific RPC methods, not just the underlying connection.


    Real-Time Security Auditing with L7 Policies

    Observability isn't just for performance; it's a critical tool for security. Cilium's network policies can enforce rules at L3/L4 and L7. Hubble allows us to observe these policies in action and audit why traffic is being dropped.

    Scenario 2: Enforcing and Auditing Granular API Access

    Setup:

    * A billing-service exposes two endpoints:

    * POST /charge (for processing payments)

    * GET /status/{id} (for checking payment status)

    * An internal payment-processor service is allowed to call POST /charge.

    * A support-dashboard service should only be allowed to call GET /status/{id}.

    Implementation: The CiliumNetworkPolicy

    We define a policy that applies to the billing-service and specifies a default-deny posture, with explicit allow rules for L7 paths and methods.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "billing-api-policy"
      namespace: "production"
    spec:
      endpointSelector:
        matchLabels:
          app: billing-service
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: payment-processor
        toPorts:
        - ports:
          - port: "8080"
            protocol: TCP
          rules:
            http:
            - method: "POST"
              path: "/charge"
      - fromEndpoints:
        - matchLabels:
            app: support-dashboard
        toPorts:
        - ports:
          - port: "8080"
            protocol: TCP
          rules:
            http:
            - method: "GET"
              path: "/status/.*"

    Auditing Policy Enforcement with Hubble:

    Now, imagine a compromised or misconfigured support-dashboard pod attempts to access the charging endpoint.

    bash
    # From the support-dashboard pod:
    kubectl exec -it ${SUPPORT_DASHBOARD_POD} -n production -- curl -X POST http://billing-service:8080/charge -d '{"amount": 1000}'
    # This request will hang and time out, as the packet is dropped by Cilium's eBPF program.

    To diagnose this, we use Hubble to find the dropped flow.

    bash
    # Look for dropped packets originating from the support-dashboard
    hubble observe -n production --from-pod ${SUPPORT_DASHBOARD_POD} --verdict DROPPED -o jsoncompact

    Output and Analysis:

    json
    {
      "flow": {
        "time": "2023-10-27T11:00:05.123Z",
        "verdict": "DROPPED",
        "drop_reason_desc": "POLICY_DENIED",
        "source": {
          "identity": 789,
          "namespace": "production",
          "pod_name": "support-dashboard-def"
        },
        "destination": {
          "identity": 901,
          "namespace": "production",
          "pod_name": "billing-service-ghi"
        },
        "Type": 2,
        "l4": {
          "TCP": {
            "destination_port": 8080
          }
        },
        "l7": {
          "http": {
            "method": "POST",
            "url": "/charge"
          }
        }
      },
      "node_name": "node-2"
    }

    This output is an unambiguous audit log. It shows:

    * verdict: DROPPED: The packet was not delivered.

    * drop_reason_desc: POLICY_DENIED: The drop was due to a network policy.

    The full L7 context (method: POST, url: /charge) proves why* the policy was violated.

    This immediate, detailed feedback loop is invaluable for both security teams auditing access patterns and for developers debugging connectivity issues in a zero-trust environment.


    Performance and Scaling Considerations

    While eBPF is highly performant, a production deployment requires understanding its operational characteristics and potential bottlenecks.

    1. eBPF Map Sizing and Pressure

    Cilium's state is held in eBPF maps in the kernel. The most important ones are the connection tracking (CT) table, the NAT table, and the policy map.

    * Problem: If a CT map fills up, the kernel can no longer track new connections, and new flows will be dropped. This can cause catastrophic, hard-to-debug application failures.

    * Monitoring: The cilium status command provides a high-level overview. For deep analysis, inspect map pressure directly:

    bash
        # Exec into a cilium agent pod
        kubectl exec -it -n kube-system cilium-xxxxx -- cilium bpf ct list global
        # Check the size vs. capacity of maps
        kubectl exec -it -n kube-system cilium-xxxxx -- cilium bpf map list

    * Mitigation: Tune map sizes via the Helm chart (bpf.lbMapMax, bpf.ctGlobalTcpMax). Sizing depends on the number of nodes, pods, services, and expected connection rate. Start with defaults, monitor pressure under load, and adjust proactively.

    2. CPU Overhead: Agent vs. Sidecar

    The performance argument for eBPF is not just about latency but also CPU efficiency.

    Sidecar Model: CPU cost scales linearly with the number of pods. N pods M vCPU/sidecar = Total Mesh CPU.

    * Cilium Model: CPU cost scales with the number of nodes and the volume of traffic. The cilium-agent DaemonSet consumes a relatively fixed amount of CPU per node, plus CPU cycles consumed by the eBPF programs in the kernel. The kernel-space execution is extremely efficient.

    Benchmark Snapshot (Conceptual):

    Metric1000 Pods, Istio (0.1 vCPU/sidecar)1000 Pods, Cilium (10 nodes, 0.5 vCPU/agent)
    Mesh CPU Overhead100 vCPUs~5 vCPUs (plus kernel time)
    Mesh Memory125 GiB (128Mi/sidecar)~5 GiB (512Mi/agent)
    P99 Latency Added5-15ms<1ms

    These numbers are illustrative but reflect the order-of-magnitude difference in resource footprint and performance impact.

    3. Hubble Data Volume and Export

    Hubble can generate a massive volume of flow data. Storing and querying this efficiently is a systems design challenge.

    * Problem: Uncontrolled flow logging can overwhelm the Cilium agent, Hubble Relay, and any downstream storage system.

    * Mitigation Strategies:

    * Sampling: Do not record every single flow. Use Hubble's configuration to sample flows, especially for high-throughput services.

    * Filtering: Configure Hubble to ignore certain flows (e.g., health checks) that generate a lot of noise.

    * Aggregation: Rely on the Prometheus metrics endpoint (hubble-metrics) for aggregated, long-term data. Use the raw flow logs from the hubble observe CLI or gRPC API primarily for targeted, real-time debugging.

    * External Export: For long-term retention and complex analytics, configure Hubble to export flows to a dedicated observability backend like Grafana Loki, Elasticsearch, or a data lake. This offloads the storage and query burden from the cluster itself.

    Example of exporting to a logging stack via Hubble's gRPC API is a common pattern, where a custom consumer can format and forward flows.

    Conclusion: Beyond the Sidecar

    The move from sidecar-based service meshes to eBPF-powered data planes represents a significant architectural evolution in cloud-native infrastructure. By leveraging Cilium and Hubble, senior engineers can achieve a level of observability that is both deeper and more performant than what was possible with previous generations of tools.

    We have moved beyond simple metrics to kernel-level flow analysis, enabling precise latency debugging, real-time security auditing, and a drastic reduction in resource overhead. This approach is not a panacea—complex, application-specific traffic policies like advanced retries or circuit breaking may still benefit from an L7 proxy. However, for the foundational pillars of service mesh—networking, security, and observability—the eBPF-based, sidecar-less model presents a compelling and production-proven path forward. It trades the complexity of user-space proxies for the operational discipline of managing kernel-level components, a trade-off that yields profound benefits in performance, efficiency, and visibility.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles