eBPF for Granular Network Policy in Multi-Cluster Kubernetes

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inherent Scaling Problem: Beyond `iptables` in Distributed Systems

In any non-trivial Kubernetes deployment, the default NetworkPolicy resource, while functional, reveals its architectural limitations. Its reliance on the host's iptables or IPVS implementation for enforcement creates a scalability bottleneck that becomes untenable in large, dynamic environments. Senior engineers who have managed clusters with thousands of pods and hundreds of policies have invariably encountered the performance degradation caused by the linear traversal of massive iptables chains. For every packet, the kernel must evaluate a potentially long list of rules, leading to increased latency and CPU overhead.

This problem is exponentially compounded in a multi-cluster architecture. Key challenges emerge that iptables-based solutions are ill-equipped to handle:

  • IP Address Management (IPAM) Conflicts: Overlapping Pod CIDR ranges across clusters make IP-based policies ambiguous and unreliable. A policy allowing traffic from 10.0.1.5 is meaningless if that IP exists in multiple clusters.
  • Ephemeral Identity: Pod IPs are transient. Tying security policy to an IP address that can change on every deployment is a fragile anti-pattern.
  • Lack of L7 Awareness: Native Kubernetes NetworkPolicy is limited to L3/L4. Enforcing rules like "allow service-A to call GET /api/v1/data on service-B but not POST /api/v1/admin" requires a service mesh, which introduces its own complexity and overhead.
  • Policy Synchronization: Manually keeping policies consistent across a fleet of clusters is error-prone and operationally burdensome.
  • To overcome these limitations, we must shift from an IP-based security model to an identity-based one, implemented at a more fundamental layer of the stack. This is where eBPF (extended Berkeley Packet Filter) provides a revolutionary approach.

    eBPF and Cilium: A Kernel-Level Paradigm Shift for Cloud-Native Networking

    eBPF allows us to run sandboxed programs directly within the Linux kernel, triggered by various events, including network packet reception. This capability enables us to bypass the cumbersome iptables chains and implement networking logic with the performance of compiled code operating directly on packet data.

    Cilium is a CNI (Container Network Interface) that leverages eBPF to provide networking, observability, and security. Instead of managing complex iptables rules, Cilium attaches eBPF programs to network interfaces (specifically at the Traffic Control tc hook). When a packet arrives, the eBPF program executes and makes an immediate policy decision.

    Stateful information, such as the mapping between a pod's IP address and its security identity, is stored in highly efficient eBPF maps (kernel-level hash maps or arrays). A pod's identity in Cilium is not its IP address but a numeric security identifier derived from its Kubernetes labels (e.g., app=frontend,env=prod).

    When pod-A attempts to communicate with pod-B, the sequence of events is as follows:

  • The eBPF program on pod-A's veth pair intercepts the outgoing packet.
  • It looks up pod-A's security identity from a local eBPF map.
  • It extracts the destination IP and looks up pod-B's security identity.
  • It consults another eBPF map containing the allowed identity pairs based on compiled CiliumNetworkPolicy resources.
  • If the policy allows identity(A) -> identity(B), the packet is forwarded. Otherwise, it is dropped.
  • This entire process occurs in the kernel, without context switching or traversing iptables chains, resulting in a dramatic performance improvement.

    Advanced Policy with `CiliumNetworkPolicy`

    Cilium extends the native NetworkPolicy with its own CRD, CiliumNetworkPolicy, which unlocks L7 capabilities. Consider a scenario where a billing-api service must allow a frontend service to read user data but restrict access to a sensitive payment endpoint, while also allowing a batch-processor to call a specific internal endpoint.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "billing-api-policy"
      namespace: "production"
    spec:
      endpointSelector:
        matchLabels:
          app: billing-api
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: frontend
        toPorts:
        - ports:
          - port: "8080"
            protocol: TCP
          rules:
            http:
            - method: "GET"
              path: "/api/v1/users/.*"
      - fromEndpoints:
        - matchLabels:
            app: batch-processor
        toPorts:
        - ports:
          - port: "8080"
            protocol: TCP
          rules:
            http:
            - method: "POST"
              path: "/api/internal/process-batch"

    Here, the policy is enforced not just on IP and port but on the HTTP method and path. Cilium's eBPF programs, combined with an embedded Envoy proxy for L7 parsing, handle this directly on the node where the billing-api pod is running.

    The Multi-Cluster Challenge: Synchronizing Identity and Policy with Cluster Mesh

    Extending this identity-based model across clusters is the primary challenge. Cilium solves this with its Cluster Mesh architecture. It creates a federated control plane that synchronizes identities and service information across all connected clusters.

    Key components of Cluster Mesh:

    * clustermesh-apiserver: A dedicated API server in each cluster that exposes identity and service information to other clusters in the mesh.

    * SPIFFE (Secure Production Identity Framework for Everyone): Used to establish a common root of trust and issue cryptographic identities (SPIFFE Verifiable Identity Documents or SVIDs) to each Cilium agent. This ensures that when cluster-a communicates with cluster-b's clustermesh-apiserver, the connection is mutually authenticated and secure.

    * Global Services: By annotating a Kubernetes Service with io.cilium/global-service: "true", Cilium makes it discoverable across the entire mesh. DNS requests for this service will resolve to endpoints in any cluster where the service is running, enabling transparent cross-cluster load balancing and failover.

    When a pod in cluster-a tries to connect to a global service backed by pods in cluster-b, the Cilium agent in cluster-a already has the security identities of the pods in cluster-b, synchronized via the mesh. The eBPF policy enforcement logic remains the same, but its scope is now global. The IP address of the destination pod is irrelevant; only its security identity matters.

    Advanced Implementation Pattern: Cross-Cluster Egress Gateway

    A common production scenario involves workloads in a modern Kubernetes environment needing to access a legacy service (e.g., a database, a third-party API) that is firewalled to a specific, static set of IP addresses. If your clusters are spread across different VPCs or regions, pods in cluster-a won't have the whitelisted IP required to access the service. We can solve this elegantly using Cilium's egress gateway feature over the Cluster Mesh.

    Scenario:

    * A data-processor workload runs in cluster-a (VPC-A).

    * A legacy PostgreSQL database is hosted outside Kubernetes and its firewall only allows connections from a specific NAT gateway in cluster-b (VPC-B).

    * We need to route traffic from the data-processor through a dedicated egress pod in cluster-b.

    Implementation:

    Step 1: Enable the Egress Gateway in cluster-b

    First, we need to enable the egress gateway feature in the Cilium configuration for cluster-b and deploy a set of pods that will act as the gateways. These pods are typically deployed in a dedicated namespace with specific node selectors and annotations to ensure they are scheduled on nodes with the correct external connectivity.

    yaml
    # egress-gateway-deployment-cluster-b.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: egress-gateway-b
      namespace: cilium-egress
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: egress-gateway-b
      template:
        metadata:
          annotations:
            # This annotation tells Cilium this pod can be an egress gateway
            egress.cilium.io/gateway-name: egress-b
          labels:
            app: egress-gateway-b
        spec:
          # Ensure these pods land on nodes with the correct external IP
          nodeSelector:
            topology.kubernetes.io/zone: us-east-1b
          containers:
          - name: unprivileged-netns
            image: k8s.gcr.io/pause:3.5
          # This pod doesn't need to run anything; its network namespace is what matters.

    Step 2: Define the CiliumEgressGatewayPolicy in cluster-a

    Next, we define the policy that directs specific traffic to use this gateway. This policy is created in cluster-a, where the source workload resides.

    yaml
    # egress-policy-cluster-a.yaml
    apiVersion: cilium.io/v2alpha1
    kind: CiliumEgressGatewayPolicy
    metadata:
      name: route-to-legacy-db
    spec:
      # Select which pods this policy applies to
      selectors:
      - podSelector:
          matchLabels:
            app: data-processor
            
      # Define the destination CIDRs that should be routed via the gateway
      destinationCIDRs:
      - "172.18.200.10/32" # IP of the legacy PostgreSQL DB
    
      # Point to the gateway pods in the remote cluster
      egressGateway:
        # This must match the annotation on the gateway pods in cluster-b
        gatewayName: egress-b
        # The cluster where the gateway resides
        clusterName: "cluster-b"

    How it works under the hood with eBPF:

  • When a data-processor pod in cluster-a makes a request to 172.18.200.10.
    • The eBPF program on its node intercepts the packet.
  • The eBPF program consults its policy maps and finds a match in the CiliumEgressGatewayPolicy.
    • Instead of routing the packet out through the node's default gateway, Cilium encapsulates the original packet in a GENEVE tunnel.
  • The outer packet's destination is the IP of one of the egress-gateway-b pods in cluster-b (discovered via Cluster Mesh).
  • The packet traverses the Cluster Mesh connection (e.g., a VPN or direct VPC peering) to cluster-b.
  • The Cilium agent on the egress-gateway-b pod's node receives the encapsulated packet, decapsulates it, and then performs a source NAT (SNAT) operation, changing the source IP to that of the egress gateway node's IP.
    • The packet is then sent to the legacy database, appearing to originate from the whitelisted IP in VPC-B.

    This entire process is transparent to the application. It doesn't need to know anything about the complex routing; it simply connects to the database IP. This pattern provides a powerful, policy-driven way to bridge modern and legacy infrastructure securely.

    Performance Analysis and Benchmarking Considerations

    The theoretical performance benefits of eBPF are clear, but quantifying them requires a structured approach. A meaningful benchmark would compare a Cilium eBPF-based setup against a traditional iptables-based CNI (like Calico in iptables mode or kube-proxy).

    Benchmark Setup:

    * Tools: netperf for latency/throughput testing, kube-burner to simulate cluster churn (creating/deleting pods and policies at a high rate).

    * Metrics:

    * Per-packet Latency: Measured with netperf TCP_RR (TCP Request/Response).

    * Throughput: Measured with netperf TCP_STREAM.

    * Policy Propagation Latency: Time from kubectl apply of a new NetworkPolicy to its enforcement, measured by repeatedly attempting a blocked connection.

    * CPU Utilization: On worker nodes under load, especially on the ksoftirqd kernel threads.

    Expected Results:

    Metric (at 1000 policies, 5000 pods)iptables-based CNICilium (eBPF)Improvement
    P99 TCP_RR Latency (µs)~350µs~80µs~4.4x
    TCP_STREAM Throughput (Gbps)~8.9 Gbps~9.8 Gbps~10%
    Policy Propagation Latency (ms)> 5000ms< 100ms>50x
    Control Plane CPU Usage (Churn)HighLow-

    The most significant difference is not raw throughput but latency and control plane stability. As the number of iptables rules grows, the time to update them increases quadratically, leading to massive CPU spikes and long propagation delays. eBPF policy updates involve atomically updating entries in a hash map, a constant time operation, which is fundamentally more scalable.

    Edge Cases and Production Debugging

    Operating a distributed system like this requires an understanding of failure modes and robust debugging tools.

    Edge Case 1: Control Plane Partition (Split Brain)

    What happens if the network link between cluster-a and cluster-b is severed? The clustermesh-apiserver instances can no longer synchronize.

    * Cilium's Behavior: Cilium operates on a fail-closed principle for established connections. Existing connections that rely on cross-cluster policies will be terminated. New connections from cluster-a to a global service in cluster-b will fail because the local agent can no longer resolve the service to endpoints in the remote cluster. Identities are cached locally for a short TTL, but will eventually expire. This is the desired behavior; in a security context, failing to connect is preferable to allowing an unauthorized connection due to stale policy data.

    Edge Case 2: Atomic Policy Updates

    When you update a CiliumNetworkPolicy, how do you prevent a transient state where traffic is either incorrectly allowed or denied? A naive implementation might flush old rules and then add new ones, creating a window of vulnerability.

    * Cilium's Solution: Cilium uses a technique analogous to double-buffering with its eBPF maps. It prepares the new policy rules in a secondary, inactive map. Once the new map is fully populated, it uses an atomic pointer-swap operation to make it the active policy map. This ensures that the policy transition is instantaneous from the kernel's perspective, with no intermediate state.

    Advanced Debugging with the cilium CLI

    When a connection fails, you need tools to inspect the live eBPF state.

  • cilium monitor -t drop --related-to : This is the most powerful tool. It provides a real-time stream of packet drop events from the kernel, with detailed reasons. You can see exactly which policy rule caused a packet to be dropped.
  • shell
        # Sample output showing a drop due to a missing L7 HTTP policy
        $ cilium monitor --type drop -n production --related-to billing-api-76b4d9f647-abcde
        xx drop (L7 protocol parsing failed) flow 0x0... identity 1234->5678 to endpoint 10.0.1.25:8080, iface eth0, verdict FORWARD, reason "Policy denied"
        -> GET /api/v1/admin/metrics

    This output immediately tells you not just that a packet was dropped, but that it was dropped by the L7 policy engine because the path /api/v1/admin/metrics was not on the allowlist.

  • cilium bpf policy get -o json: This command allows you to dump the exact policy rules loaded into the eBPF maps for a specific pod endpoint. You can see the raw security identity numbers and the corresponding allowed peers. This is invaluable for verifying that the policy you defined in YAML has been correctly compiled and loaded into the kernel.
  • json
        // Simplified JSON output
        [
          {
            "endpoint-id": 1234,
            "ingress": [
              {
                "from-endpoints": [
                  5678 // Identity of 'frontend' pods
                ],
                "to-ports": [
                  {
                    "port": 8080, "protocol": "TCP",
                    "l7-rules": {
                      "http": [
                        { "method": "GET", "path": "/api/v1/users/.*" }
                      ]
                    }
                  }
                ]
              }
            ]
          }
        ]

    Conclusion: The Future of Cloud-Native Networking is in the Kernel

    By moving network policy enforcement from the brittle and unscalable iptables model into the programmable kernel with eBPF, we can build multi-cluster Kubernetes environments that are not only faster but also more secure and observable. The identity-based approach decouples security from the underlying network topology, a critical requirement for modern, distributed applications. Advanced patterns like the cross-cluster egress gateway, once requiring complex manual routing and VPN configuration, can now be declared in a simple YAML manifest.

    For senior engineers and SREs, mastering eBPF-based tooling like Cilium is no longer a niche skill but a fundamental component of building and operating resilient, large-scale systems. The ability to program the kernel directly to solve networking and security challenges opens up a new frontier, paving the way for even more sophisticated applications like sidecar-less service meshes, fine-grained observability, and real-time security auditing, all running with near-native performance.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles