eBPF Observability in K8s: Cilium & Hubble for Production
Beyond the Sidecar: The Kernel-Level Observability Revolution with eBPF
For years, the de facto standard for achieving deep observability in Kubernetes has been the service mesh, typically implemented via a sidecar proxy like Envoy (used by Istio and others). While powerful, this model introduces significant operational and performance overhead: every pod gets a dedicated user-space proxy, increasing resource consumption, adding latency to the request path, and complicating TLS and certificate management. Senior engineers managing large-scale clusters know this pain well. The trade-off between visibility and performance has always been a contentious point.
eBPF (extended Berkeley Packet Filter) fundamentally changes this equation. By running sandboxed programs directly within the Linux kernel, eBPF allows us to implement networking, security, and observability logic without modifying application code or injecting user-space proxies. Cilium leverages eBPF to provide a CNI (Container Network Interface) that operates at the kernel level, offering unparalleled performance and visibility.
This article is not an introduction to eBPF or Cilium. It's a production-focused guide for engineers who understand the fundamentals of Kubernetes networking and are evaluating or implementing Cilium for its advanced observability capabilities. We will dissect the practical implementation of Cilium and its observability component, Hubble, focusing on production-ready configurations, advanced debugging techniques, and performance tuning considerations.
The Technical Advantage: Kernel-Space vs. User-Space
Let's quantify the difference. In a sidecar model:
- A packet arrives at the node's network interface.
iptables
or IPVS
) redirects it to the pod's network namespace.- The packet is routed to the Envoy sidecar proxy listening on a specific port.
- Envoy (in user-space) processes the packet, gathers metrics, enforces policy, and then forwards it to the application container, again via the kernel's loopback interface.
This context switching between kernel-space and user-space for every packet adds latency and consumes CPU cycles.
With Cilium's eBPF datapath:
- A packet arrives at the node's network interface.
- An eBPF program attached to the network driver or Traffic Control (TC) hook point inspects the packet directly in kernel-space.
iptables
).This process eliminates multiple context switches and user-space hops, resulting in significantly lower latency and reduced CPU overhead, especially at high throughput.
Section 1: Deploying a Production-Ready Cilium & Hubble Stack
Deploying Cilium via Helm is straightforward, but a production deployment requires careful configuration. A default helm install
is insufficient. Here's a breakdown of a robust values.yaml
for enabling a full observability stack.
Prerequisites
* A Kubernetes cluster (v1.23+ recommended).
* A Linux kernel that supports eBPF fully (5.10+ is ideal for all features).
* Helm v3+.
Advanced Helm Configuration (`values.yaml`)
This configuration enables Hubble for UI and CLI access, exposes Prometheus metrics for scraping, and configures the operator for high availability.
# values-production.yaml
# Use a stable, recent version of Cilium
image:
tag: "v1.15.1"
# Configure eBPF settings for modern kernels
ebpf:
# Mount BPF filesystem
mountPath: /sys/fs/bpf
# Operator configuration for HA and Prometheus integration
operator:
replicas: 2
prometheus:
enabled: true
serviceMonitor:
enabled: true # For Prometheus Operator
# Enable Hubble for deep flow visibility
hubble:
enabled: true
relay:
enabled: true
prometheus:
serviceMonitor:
enabled: true
ui:
enabled: true
metrics:
enabled:
- "dns"
- "drop"
- "tcp"
- "flow"
- "port-distribution"
- "icmp"
- "httpV2:exemplar=true;labels=source_ip,source_namespace,source_pod,destination_ip,destination_namespace,destination_pod,traffic_direction,http_method,http_path,http_status_code"
# Enable Prometheus scraping on Cilium agents
prometheus:
enabled: true
serviceMonitor:
enabled: true
# Crucial for L7 visibility. Default is "default" which can be permissive.
# "always" ensures policies are always enforced, giving better visibility into drops.
policyEnforcementMode: "always"
# Enable L7 protocol parsing for HTTP/gRPC. This is essential for Hubble's L7 features.
# Note: This adds some overhead. We'll discuss tuning this later.
httpProtocol: true
grpcProtocol: true
# For cloud environments, set the native routing CIDR
# Example for AWS
# ipam:
# operator:
# clusterPoolIPv4PodCIDR: "10.0.0.0/16"
# Security context for production hardening
securityContext:
privileged: false
# For clusters with kube-proxy disabled for maximum performance
# kubeProxyReplacement: strict
Deployment Steps
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium --version 1.15.1 \
-n kube-system \
-f values-production.yaml
cilium status --wait
You should see all components healthy and the KubeProxyReplacement status reflecting your configuration.
Section 2: Deep Network Flow Analysis with Hubble
With Cilium running, Hubble becomes our microscope into the cluster's network traffic. We'll deploy a sample microservices application to demonstrate its power.
Sample Application
Let's use a simple three-tier app: frontend
-> api-service
-> db-service
.
# sample-app.yaml
apiVersion: v1
kind: Namespace
metadata:
name: demo-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
namespace: demo-app
labels:
app: frontend
spec:
replicas: 1
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
containers:
- name: frontend
image: nginx # Simple placeholder
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: demo-app
labels:
app: api-service
tier: backend
spec:
replicas: 1
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
tier: backend
spec:
containers:
- name: api-service
image: kennethreitz/httpbin # A useful HTTP testing service
---
apiVersion: v1
kind: Service
metadata:
name: api-service
namespace: demo-app
spec:
selector:
app: api-service
ports:
- protocol: TCP
port: 80
targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: db-service
namespace: demo-app
labels:
app: db-service
tier: database
spec:
replicas: 1
selector:
matchLabels:
app: db-service
template:
metadata:
labels:
app: db-service
tier: database
spec:
containers:
- name: db-service
image: redis
Apply this manifest: kubectl apply -f sample-app.yaml
Advanced Hubble CLI Usage
First, port-forward the Hubble Relay service:
kubectl port-forward -n kube-system svc/hubble-relay 4245:80
Now, use the hubble
CLI. Let's explore some advanced queries.
demo-app
namespace, including L7 details: hubble observe --namespace demo-app --protocol http -o json | jq
This provides raw, detailed JSON output for every observed HTTP request, perfect for scripting or further analysis.
Let's apply a restrictive network policy that only allows frontend
to talk to api-service
.
# netpol.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "api-allow-frontend"
namespace: demo-app
spec:
endpointSelector:
matchLabels:
app: api-service
ingress:
- fromEndpoints:
- matchLabels:
app: frontend
toPorts:
- ports:
- port: "80"
protocol: TCP
rules:
http:
- method: "GET"
path: "/get"
Apply it: kubectl apply -f netpol.yaml
Now, try to POST
from frontend
to api-service
:
# Exec into the frontend pod
kubectl exec -it -n demo-app $(kubectl get pods -n demo-app -l app=frontend -o name) -- bash
# Inside the pod, make an allowed request
curl http://api-service/get
# Make a disallowed request
curl -X POST http://api-service/post
The second curl
will hang and time out.
Now, let's debug this with Hubble:
hubble observe --namespace demo-app --verdict DROPPED
Expected Output:
Feb 26 15:30:10.123 [kube-system/cilium-abcd1] demo-app/frontend-xyz -> demo-app/api-service-qrs:80 http-post DROPPED (Policy denied)
This instantly tells you the source, destination, protocol, and exactly why the packet was dropped: Policy denied
. This is invaluable for debugging complex network policies.
Using the Hubble UI for Visual Analysis
Port-forward to the Hubble UI service:
cilium hubble ui
This will open a browser window showing a live service map of your cluster. Here's what to look for:
* Service Map: You'll see nodes for frontend
, api-service
, and db-service
. Allowed traffic (the GET
request) will show as a solid green line. The denied POST
request will appear as a flashing red line, making it immediately obvious where connectivity issues lie.
* Flow Inspection: Clicking on the line between frontend
and api-service
will show you a list of recent flows. You can filter for DROPPED
verdicts and inspect the L7 details of the denied request, including the HTTP method (POST
) and path (/post
) that violated the policy.
Section 3: Production Metrics with Prometheus and Grafana
While Hubble is excellent for real-time debugging, Prometheus is the tool for long-term monitoring, alerting, and trend analysis. Cilium exposes a rich set of metrics.
Key Metrics to Monitor
Assuming you have a Prometheus Operator stack, the ServiceMonitor
objects we enabled will automatically start scraping metrics. Here are the most critical ones:
* cilium_drop_count_total{reason="Policy denied"}
: The absolute number one metric for network policy health. A sudden increase indicates misconfiguration or an unauthorized access attempt. You should have an alert on this.
* cilium_forward_count_total
: A byte and packet counter for all forwarded traffic. Useful for understanding traffic volume between services.
* hubble_flows_processed_total
: Monitors the health of Hubble itself. A drop could indicate an issue with the observability pipeline.
* hubble_http_requests_total
: L7 metric giving you HTTP request counts, filterable by method, path, and status code. This provides a RED (Rate, Errors, Duration) metrics baseline without instrumenting your application.
* cilium_endpoint_regeneration_time_stats_seconds
: Tracks how long it takes Cilium to regenerate an endpoint's eBPF program when its identity or policy changes. Spikes can indicate an overloaded operator or control plane issues.
Building a Production Grafana Dashboard
Here are some PromQL queries for a powerful Cilium dashboard:
topk(10, sum(rate(cilium_drop_count_total[5m])) by (reason, direction, source_namespace, destination_namespace))
This gives you an immediate overview of why packets are being dropped across the cluster.
sum(rate(hubble_http_requests_total{http_status_code=~"5.."}[5m])) by (source_pod, destination_pod)
/
sum(rate(hubble_http_requests_total[5m])) by (source_pod, destination_pod)
This query leverages Hubble's L7 metrics to calculate a crucial application-level SLO without any application-side instrumentation.
sum(rate(cilium_policy_enforcement_total[5m])) by (namespace, direction)
This shows how actively your network policies are being evaluated, which can be useful for identifying unexpectedly permissive or restrictive rules.
Section 4: Edge Cases & Performance Tuning
This is where senior-level operational knowledge comes into play. Running Cilium in production requires understanding its boundaries and how to tune it.
Edge Case 1: Handling Non-HTTP L7 Protocols (e.g., Kafka, Redis)
Cilium's L7 parsing is currently limited to HTTP, gRPC, and a few other known protocols. What about our db-service
running Redis?
* Problem: You cannot write a CiliumNetworkPolicy
that says "allow api-service
to run the SET
command but not FLUSHALL
."
* Solution: You must fall back to L3/L4 policies. You can restrict access to the Redis pod on its port (6379) from the api-service
pod. This is less granular but still provides essential network segmentation.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "db-allow-api-service"
namespace: demo-app
spec:
endpointSelector:
matchLabels:
app: db-service
ingress:
- fromEndpoints:
- matchLabels:
app: api-service
toPorts:
- ports:
- port: "6379"
protocol: TCP
Edge Case 2: The Overhead of L7 Visibility
Enabling HTTP parsing on every pod (httpProtocol: true
) is great for observability but has a performance cost. The eBPF program becomes more complex, and more data needs to be sent from the kernel to the user-space Hubble agent.
* Problem: In a high-throughput cluster, full L7 parsing everywhere can consume non-trivial CPU.
* Solution: Use policy-based L7 visibility. Disable global parsing and enable it only where needed via annotations or policy.
In your values.yaml
:
# Disable global parsing
httpProtocol: false
Then, in a CiliumNetworkPolicy
, specify which protocols to parse for that specific traffic:
...
toPorts:
- ports:
- port: "80"
protocol: TCP
rules:
http: # This enables HTTP parsing for this specific rule
- {}
This targeted approach gives you L7 visibility on critical application endpoints while keeping the overhead minimal for high-volume, internal traffic (e.g., database connections).
Performance Tuning: eBPF Map Sizing
eBPF programs use maps to store state (e.g., connection tracking entries, policy identities). In very large clusters, the default map sizes might be insufficient.
* Symptom: You might see BPF map is full
errors in the cilium-agent
logs, leading to dropped connections or incorrect policy enforcement.
* Tuning Parameters (via Helm values.yaml
or agent arguments):
* bpf.ctGlobalTcpMax
: Controls the size of the TCP connection tracking map. Default is ~1 million. Increase this for nodes handling a massive number of concurrent TCP connections.
* bpf.natMapEntries
: Size of the NAT map. Important for nodes running NodePort
services or using egress gateways.
* bpf.policyMapEntries
: Size of the policy map. In clusters with thousands of services and complex policies, this may need to be increased.
Example Adjustment:
# In values.yaml
bpf:
ctGlobalTcpMax: 2000000
natMapEntries: 131072
Tuning these requires careful monitoring of map pressure via cilium bpf ct list
and cilium bpf nat list
commands on the node.
Conclusion: A New Paradigm for Cloud-Native Observability
Transitioning from sidecar-based service meshes to an eBPF-powered datapath with Cilium is more than just a performance optimization; it's a paradigm shift in how we approach cloud-native observability. By moving visibility to the kernel, we gain a more efficient, secure, and comprehensive view of our systems.
We've covered the practical steps for a production deployment, moving beyond simple installation to advanced configuration, deep debugging with the Hubble CLI, and integrating with Prometheus for long-term monitoring. More importantly, we've addressed the critical edge cases and performance tuning considerations that separate a proof-of-concept from a production-ready system.
As eBPF continues to mature, we can expect even more capabilities to be pushed into the kernel, from advanced security monitoring with tools like Tetragon to even more efficient load balancing. For senior engineers responsible for the stability, performance, and security of large-scale Kubernetes clusters, mastering eBPF-based tools like Cilium is no longer a niche skill—it's becoming a fundamental component of the modern cloud-native stack.