eBPF-Powered Network Policies in Cilium for K8s Clusters

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Iptables Bottleneck: Why Native K8s Networking Falters at Scale

For any seasoned engineer operating Kubernetes in production, the name kube-proxy in iptables mode is synonymous with a well-known performance ceiling. While functional for small clusters, its design principles show significant strain as the number of Services and Pods scales into the thousands. The core issue lies in the Linux kernel's netfilter framework and its user-space utility, iptables.

In a typical iptables-based cluster, for every Service, kube-proxy creates a set of rules. When a packet destined for a Service's ClusterIP arrives at a node, it must traverse a series of iptables chains (e.g., KUBE-SERVICES, KUBE-SVC-, KUBE-SEP-). This traversal is fundamentally a linear search. The kernel iterates through the rules in a chain until a match is found. With 10 Services, this is negligible. With 10,000 Services, the latency introduced by this O(n) complexity becomes a critical performance bottleneck, impacting service-to-service communication latency.

Consider a large cluster. A iptables-save command might reveal tens of thousands of lines, creating a massive, unwieldy ruleset that is difficult to debug and slow to update. Every time a Pod or Service is created or destroyed, kube-proxy must synchronize these rules across every node in the cluster, leading to update latency and potential race conditions.

Beyond raw performance, standard Kubernetes NetworkPolicy objects, while useful, are fundamentally limited to L3/L4 constructs (IP addresses and ports). They operate on a simple premise: if a Pod with label app=backend is allowed to talk to a Pod with label app=database on port 5432, the policy matches based on the Pod's IP address. This model lacks the granularity required for modern microservice architectures:

  • No L7 Awareness: It cannot distinguish between GET /api/v1/read and DELETE /api/v1/write to the same endpoint.
  • No DNS Awareness: Egress rules are restricted to IP blocks (CIDRs). Allowing a Pod to communicate with api.thirdparty.com is challenging, as the underlying IPs can change dynamically.
  • Identity is IP-based: Security is tied to an ephemeral IP address, not a cryptographic or logical workload identity.
  • This is where the paradigm shifts. Cilium, powered by eBPF, bypasses iptables entirely for in-cluster networking, addressing these limitations at the kernel level.

    eBPF and Cilium: A Kernel-Level Revolution for Cloud-Native Networking

    eBPF (extended Berkeley Packet Filter) is a revolutionary kernel technology that allows sandboxed programs to run directly within the Linux kernel without changing kernel source code or loading kernel modules. Think of it as event-driven, kernel-space JavaScript. You can attach eBPF programs to various kernel hooks, such as network interface events, system calls, and tracepoints.

    Cilium leverages this capability by attaching eBPF programs to network hooks at the earliest possible point, primarily the Traffic Control (TC) ingress/egress hooks on network devices (like veth pairs for Pods). When a packet enters or leaves a Pod's network namespace, it triggers Cilium's eBPF program.

    Here’s a simplified breakdown of the eBPF-based path versus the iptables path:

    Iptables Data Path:

    Packet -> NIC -> TC Layer -> Netfilter (PREROUTING -> FORWARD/INPUT -> POSTROUTING) -> TC Layer -> Egress

    Within Netfilter, the packet traverses numerous chains: Mangle, NAT, Filter, etc., including the KUBE-SERVICES chain.

    Cilium eBPF Data Path:

    Packet -> NIC -> TC Ingress Hook (eBPF Program) -> Network Stack (if allowed) -> TC Egress Hook (eBPF Program) -> Egress

    Cilium's eBPF program performs several key functions directly in the kernel:

    Service Load Balancing: Instead of traversing iptables chains, the eBPF program performs a highly efficient hash table lookup in an eBPF map* to find the correct backend Pod IP for a given Service ClusterIP. This is an O(1) operation, regardless of the number of Services.

    * Policy Enforcement: The core of Cilium's security model is its use of security identities. When a Cilium-managed Pod starts, Cilium assigns it a unique numeric identity based on its labels. This mapping (pod labels -> security identity) is shared across the cluster. The eBPF program then makes policy decisions based on these compact numeric IDs, not on ephemeral IP addresses.

    * State Management with eBPF Maps: eBPF maps are a generic key/value store in the kernel that eBPF programs can access with high efficiency. Cilium uses them extensively to store state, such as:

    * Service IP -> Backend Pod IPs

    * Security Identity -> Allowed Security Identities

    * IP -> Security Identity (for traffic from outside the cluster)

    This architectural shift from a linear rule-based system to a map-based lookup system is the source of its performance gains and advanced capabilities.

    Advanced Policies with `CiliumNetworkPolicy` CRDs

    While Cilium can enforce standard Kubernetes NetworkPolicy objects, its true power is unlocked via its own Custom Resource Definition (CRD), CiliumNetworkPolicy. This CRD extends the Kubernetes API with the advanced features we need.

    Scenario 1: Identity-Based Policies Beyond Pod Selectors

    Imagine a standard three-tier application: frontend, api-backend, and database. We want to enforce strict communication paths.

    * frontend can talk to api-backend on port 8080.

    * api-backend can talk to database on port 5432.

    * No other traffic is allowed.

    With a standard NetworkPolicy, you'd use podSelector. With CiliumNetworkPolicy, we can use more abstract identity concepts like service accounts or labels from other namespaces.

    Here is a production-grade implementation:

    yaml
    # api-version: cilium.io/v2
    # kind: CiliumNetworkPolicy
    # metadata:
    #   name: api-backend-policy
    #   namespace: production
    # spec:
    #   endpointSelector:
    #     matchLabels:
    #       app: api-backend
    #   ingress:
    #   - fromEndpoints:
    #     - matchLabels:
    #         app: frontend
    #     toPorts:
    #     - ports:
    #       - port: "8080"
    #         protocol: TCP
    #   egress:
    #   - toEndpoints:
    #     - matchLabels:
    #         app: database
    #     toPorts:
    #     - ports:
    #       - port: "5432"
    #         protocol: TCP
    

    This looks similar to a standard policy, but the magic is in how Cilium enforces it. When the frontend pod tries to connect to api-backend, the eBPF program on the source node does the following:

  • Looks up the security identity of the frontend pod.
  • Looks up the destination IP, resolves it to the api-backend pod, and finds its identity.
  • Consults an eBPF map containing the allowed identity pairs ().
  • If the pair (identity_frontend, identity_api_backend) is present for port 8080, the packet is forwarded. Otherwise, it's dropped.
  • This is all done in the kernel, at near line-rate speed.

    Scenario 2: DNS-Aware Egress Policies for External Services

    This is a classic problem. A payment processing service needs to communicate with api.stripe.com. You cannot create a NetworkPolicy with an egress rule for a domain name. The traditional, insecure approach is to allow egress to 0.0.0.0/0 or a wide IP range, which violates the principle of least privilege.

    CiliumNetworkPolicy solves this with toFQDNs.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: payment-service-egress
      namespace: production
    spec:
      endpointSelector:
        matchLabels:
          app: payment-service
      egress:
      - toFQDNs:
        - matchName: "api.stripe.com"
        toPorts:
        - ports:
          - port: "443"
            protocol: TCP
      # Also allow DNS traffic itself!
      - toEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": kube-system
            "k8s:k8s-app": kube-dns
        toPorts:
        - ports:
          - port: "53"
            protocol: UDP
          rules:
            dns:
            - matchPattern: "*.stripe.com"

    How this works under the hood:

  • The cilium-agent on the node sees this policy.
  • It begins monitoring DNS responses for api.stripe.com that originate from pods matching the endpointSelector.
  • When the payment-service pod performs a DNS lookup for api.stripe.com, the cilium-agent (or its eBPF program) intercepts the response.
  • It extracts the resolved IP addresses (e.g., 52.1.2.3, 54.3.2.1) from the DNS A/AAAA records.
  • It programs the eBPF policy map on the node, dynamically adding a rule to allow egress from the payment-service pod's identity to these specific destination IPs on port 443.
    • The agent respects the DNS record's TTL. When the TTL expires, it removes the IPs from the allowlist, forcing a fresh DNS lookup on the next connection attempt. This ensures the policy adapts to changing external IPs.

    This is a game-changer for securing egress traffic. The policy for allowing DNS itself is also critical; here we restrict DNS lookups to only the *.stripe.com pattern, providing another layer of security.

    Scenario 3: L7 Protocol-Aware Policies (HTTP)

    Let's take our security a step further. We have a metrics-scraper service that should only be allowed to access the /metrics endpoint on our microservices, and nothing else. An L4 policy would have to allow access to the entire service port.

    Cilium can enforce L7 policies by transparently integrating a proxy (like Envoy) into the data path when required.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: api-metrics-policy
      namespace: production
    spec:
      endpointSelector:
        matchLabels:
          app: api-backend
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: metrics-scraper
        toPorts:
        - ports:
          - port: "8080"
            protocol: TCP
          rules:
            http:
            - method: "GET"
              path: "/metrics"

    The L7 Data Path:

  • A packet arrives from metrics-scraper destined for api-backend on port 8080.
  • The eBPF program at the TC hook on the api-backend's veth interface matches the L3/L4 identity rule.
  • Because the policy contains an L7 http rule, the eBPF program knows not to forward the packet directly to the application socket.
  • Instead, it transparently redirects the packet to a listener on the Envoy proxy running within the cilium-agent pod on that same node.
  • Envoy inspects the HTTP request. If it's a GET request for the /metrics path, it forwards the request to the application pod.
  • If the request is, for example, POST /admin/delete, Envoy will reject it with a 403 Forbidden response.
  • Performance Consideration: This redirection from kernel (eBPF) to user-space (Envoy) and back introduces latency compared to pure L4 eBPF forwarding. This is an unavoidable trade-off for L7 visibility. The key is to apply L7 policies surgically, only to the endpoints that absolutely require them, while using high-performance L4 policies for all other traffic.

    Cilium supports other L7 protocols, such as Kafka. You could write a policy allowing a service to produce to the orders topic but not consume from the payments topic.

    Production Patterns and Observability with Hubble

    Defining policies is only half the battle. In a complex system, you need to understand, observe, and debug traffic flows.

    Pattern 1: Cluster-Wide Default-Deny

    For a truly secure posture, a best practice is to start with a default-deny policy for the entire cluster. This forces development teams to explicitly define what communication their applications need. This can be achieved with a CiliumClusterwideNetworkPolicy.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumClusterwideNetworkPolicy
    metadata:
      name: cluster-default-deny
    spec:
      endpointSelector: {}
      ingress:
      - fromEndpoints:
        - matchLabels:
            # This is an empty selector, which will never match any pod.
            non-existent-label: ""

    This policy selects all pods (endpointSelector: {}) and defines an ingress rule that will never match, effectively blocking all pod-to-pod communication by default. You would then layer more specific, permissive policies on top of this.

    Pattern 2: Debugging with Hubble

    Hubble is Cilium's observability component. It leverages the same eBPF data source to provide deep insights into network traffic and policy decisions without any instrumentation.

    Scenario: A developer deploys a new user-service and reports it cannot connect to the auth-database. The connection is timing out.

    As a platform engineer, your first step is to use the hubble CLI:

    bash
    # Watch for traffic from the user-service pod, specifically looking for drops.
    hubble observe --from pod:production/user-service-7f... -n production --verdict DROPPED
    
    # Sample Output:
    # TIMESTAMP            SOURCE                                  DESTINATION                             TYPE      VERDICT   REASON
    # Apr 23 10:30:15.123  production/user-service-7f... -> production/auth-database-5c... (10.0.1.23:5432)  L4_DROP   DROPPED   Policy denied

    The output immediately tells you the packet was dropped due to a policy (Policy denied). You've confirmed the root cause in seconds. It's not a DNS issue, a firewall rule, or an application crash; it's a missing network policy.

    To proactively debug before deploying, you can use cilium policy trace. This tool simulates how the policy engine would treat a packet without actually sending it.

    bash
    # Simulate a packet from a pod with label app=user-service to a pod with label app=auth-database on port 5432
    cilium policy trace --src-labels app=user-service --dst-labels app=auth-database --dport 5432
    
    # Sample Output:
    # -> Verdict: DENIED
    # -> Policy DTrace: No matching policy found

    This powerful tool allows you to validate your policies and troubleshoot connectivity issues declaratively, which is essential in a GitOps workflow.

    Advanced Edge Cases and Performance

    Host Networking and Node Security

    Pods running with hostNetwork: true are a security challenge because they bypass the pod network namespace and bind directly to the node's network interface. By default, Cilium policies targeting pods may not apply. To secure these workloads and the nodes themselves, you must use CiliumHostEndpoint objects. This CRD applies Cilium's identity-based security to the host network stack, allowing you to write policies that control traffic to/from the node itself, or from hostNetwork pods.

    Transparent Encryption with WireGuard

    Cilium can provide transparent pod-to-pod and node-to-node encryption using WireGuard. Because policy enforcement happens in the eBPF layer before packets are passed to the WireGuard stack for encryption, you get the best of both worlds: full observability and policy control over unencrypted traffic, and secure transport over the underlying network.

    CPU/Memory Overhead

    While the eBPF data path is highly efficient, the cilium-agent DaemonSet does consume resources. Its consumption depends on the features enabled. L3/L4 policy enforcement has very low overhead. Enabling L7 policies will increase CPU and memory usage due to the embedded Envoy proxy. Monitoring the resource consumption of the cilium-agent is crucial. For very large nodes, you may need to adjust the CPU/memory limits and tune parameters like eBPF map sizes.

    Conclusion

    Moving from an iptables-based CNI to a Cilium and eBPF-powered one is more than a performance upgrade; it's a fundamental shift in how we approach cloud-native security and observability. By leveraging direct kernel programmability, we can move beyond IP-based rules to a more robust, identity-aware security model. Advanced features like DNS-aware egress and L7 protocol filtering, which were once complex to implement with sidecar proxies and service meshes, can now be enforced efficiently at the CNI layer.

    For senior engineers building and securing distributed systems on Kubernetes, mastering these advanced policy constructs and observability tools is no longer a niche skill. It is a necessary step to building scalable, performant, and truly zero-trust environments.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles