Kernel-Level K8s Security: eBPF for Granular L7 Network Policies

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Scaling Ceiling of `iptables` in Modern Kubernetes

For years, iptables has been the bedrock of Kubernetes networking, powering both service routing via kube-proxy and network policy enforcement. Its integration with the kernel's Netfilter framework is well-understood. However, in large-scale, dynamic microservice environments, this bedrock begins to crack. Senior engineers managing clusters with thousands of services and tens of thousands of pods have personally felt the pain points:

  • Linear Complexity (O(n)): iptables processes rules in sequential chains. As the number of services (-A KUBE-SERVICES...) or policy rules grows, the traversal time for each packet increases linearly. This introduces non-trivial latency and CPU overhead on every node.
  • Kernel Lock Contention: Updating iptables rules requires taking a lock on the entire rule set. In a highly dynamic environment with frequent pod churn, this leads to lock contention, delaying updates and impacting control plane responsiveness.
  • Lack of L7 Awareness: Standard Kubernetes NetworkPolicy is limited to L3/L4 (IP addresses and ports). Enforcing policies like "allow GET requests to /api/v1/metrics but deny POST" requires a service mesh, which introduces its own complexity, resource overhead (sidecar proxies), and latency.
  • Debugging Opacity: Tracing a packet's journey through a massive, auto-generated iptables ruleset is a notorious operational burden. Determining why a packet was dropped can involve manually parsing hundreds of rules, a process that doesn't scale.
  • Consider a simple packet lookup in a 10,000-service cluster. The packet hits the PREROUTING chain, jumps to the KUBE-SERVICES chain, and then must potentially traverse thousands of rules to find a match. This is repeated for every new connection.

    bash
    # A glimpse into the complexity on a node
    $ iptables-save | grep KUBE-SERVICES | wc -l
    10001
    
    # Each packet for a new connection must traverse this chain.

    This is where eBPF (extended Berkeley Packet Filter) represents a fundamental paradigm shift. Instead of routing packets through complex chains in a generic framework, eBPF allows us to attach sandboxed, event-driven programs directly to kernel hooks, processing network traffic with the efficiency of compiled code.

    This article will demonstrate how to leverage eBPF via Cilium to implement high-performance, L7-aware network policies, completely bypassing iptables and gaining unprecedented observability directly from the kernel.

    Architectural Shift: From Netfilter Chains to eBPF Hooks

    To appreciate the performance and capability gains, it's crucial to understand the architectural difference between the two models.

    The iptables / Netfilter Model:

    Packets traverse a series of well-defined hooks in the kernel's networking stack (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING). kube-proxy and CNI plugins insert chains of rules at these hooks. The kernel must walk these chains for each packet.

    Diagrammatically:

    Packet -> NIC -> Netfilter (PREROUTING -> KUBE-SERVICES chain walk) -> ...

    The eBPF Model:

    eBPF programs are attached to different, often earlier, hooks, primarily at the Traffic Control (TC) ingress/egress layer or even earlier with XDP (Express Data Path) directly on the driver level.

    * TC (Traffic Control) Hook: An eBPF program attached here can see all network traffic entering or leaving a network device (physical or virtual, like a veth pair for a pod). It can make decisions—allow, drop, redirect—before the packet even enters the iptables PREROUTING stage.

    * BPF Maps: The key to eBPF's performance is its use of BPF maps. These are highly efficient key/value stores accessible from both kernel-space eBPF programs and user-space control planes. Instead of linear rule scans, eBPF programs perform O(1) hash map lookups to determine service IPs, policy rules, and connection tracking state.

    Diagrammatically:

    Packet -> NIC -> TC Hook -> eBPF Program (O(1) map lookup) -> Decision (Allow/Drop/Redirect)

    Cilium leverages this by replacing kube-proxy entirely. It watches the Kubernetes API for Services and Endpoints and populates BPF maps with this information. When a packet from a pod arrives at the TC hook on the veth interface, the attached eBPF program performs a quick lookup in a BPF map to find the destination backend pod's IP and performs DNAT directly, bypassing iptables entirely.


    Practical Implementation: L7 Policies with Cilium

    Let's move from theory to a production-grade implementation. We will set up a local Kubernetes cluster using Kind and install Cilium with kube-proxy replacement enabled to unlock the full power of eBPF.

    Step 1: Environment Setup

    First, create a Kind cluster configuration that disables the default CNI and kube-proxy to allow Cilium to take over.

    kind-config.yaml

    yaml
    kind: Cluster
    apiVersion: kind.x-k8s.io/v1alpha4
    networking:
      disableDefaultCNI: true # We will install Cilium
      kubeProxyMode: "none" # We will replace kube-proxy with eBPF
    nodes:
    - role: control-plane
    - role: worker
    - role: worker

    Now, create the cluster:

    bash
    kind create cluster --config=kind-config.yaml --name=ebpf-l7-demo

    Next, install the Cilium CLI and deploy Cilium to the cluster. We'll use Helm for a configurable installation.

    bash
    helm repo add cilium https://helm.cilium.io/
    
    helm install cilium cilium/cilium --version 1.15.5 \
       --namespace kube-system \
       --set kubeProxyReplacement=strict \
       --set bpf.masquerade=true \
       --set securityContext.capabilities.cilium={CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID} \
       --set securityContext.capabilities.operator.cilium={CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID} \
       --set cgroup.autoMount.enabled=false \
       --set cgroup.hostRoot=/sys/fs/cgroup

    Key Configuration Flags:

    * kubeProxyReplacement=strict: This tells Cilium to completely handle service translation using eBPF, ensuring iptables is not used for Kubernetes Services.

    * bpf.masquerade=true: Enables eBPF-based masquerading for traffic leaving the cluster, another task typically handled by iptables.

    Verify the installation. You should see kube-proxy is absent and Cilium is running in kube-system.

    bash
    cilium status --wait
    # Expected output (abbreviated):
    # ...
    # KubeProxy Replacement:   Strict
    # ...
    # All Cilium pods are ready

    Step 2: Deploying a Sample Microservice Application

    To demonstrate L7 policies, we'll use a simple scenario: a client pod trying to access an api-server pod. The api-server exposes two endpoints: /public and /private.

    app.yaml

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: api-server
      labels:
        app: api-server
        policy-group: backend
    spec:
      containers:
      - name: server
        image: mendhak/http-https-echo
        ports:
          - containerPort: 80
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: api-service
    spec:
      selector:
        app: api-server
      ports:
      - protocol: TCP
        port: 80
        targetPort: 80
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: client
      labels:
        app: client
        policy-group: frontend
    spec:
      containers:
      - name: client
        image: appropriate/curl
        command: ["sleep", "3600"]

    Deploy the application:

    kubectl apply -f app.yaml

    By default, with no policies in place, the client can access both endpoints on the api-server.

    bash
    # Exec into the client pod
    kubectl exec -it client -- sh
    
    # Test access - both should succeed
    / # curl -s http://api-service/public
    / # echo $?
    0
    
    / # curl -s http://api-service/private
    / # echo $?
    0

    Step 3: Implementing an L7-Aware `CiliumNetworkPolicy`

    Now, we'll enforce a policy: allow GET requests to /public but deny everything else. This requires L7 awareness, which is impossible with standard NetworkPolicy objects. We use a CiliumNetworkPolicy CRD.

    l7-policy.yaml

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "api-l7-policy"
    spec:
      endpointSelector:
        matchLabels:
          app: api-server
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: client
        toPorts:
        - ports:
          - port: "80"
            protocol: TCP
          rules:
            http:
            - method: "GET"
              path: "/public"

    Dissecting the Policy:

    * endpointSelector: The policy applies to pods with the label app: api-server.

    * fromEndpoints: It allows ingress traffic only from pods with the label app: client.

    * toPorts.rules.http: This is the L7 magic. It specifies that for traffic on TCP port 80, the HTTP request must have the method GET and the path /public.

    Apply the policy:

    kubectl apply -f l7-policy.yaml

    Step 4: Verifying the L7 Policy Enforcement

    Let's re-run our tests from the client pod.

    bash
    kubectl exec -it client -- sh
    
    # This should SUCCEED (matches the policy)
    / # curl -s -o /dev/null -w "%{http_code}" http://api-service/public
    200
    
    # This should FAIL (path mismatch)
    / # curl -s -o /dev/null -w "%{http_code}" http://api-service/private
    000 # curl exits with an error, connection reset by peer
    
    # This should FAIL (method mismatch)
    / # curl -X POST -s -o /dev/null -w "%{http_code}" http://api-service/public
    000

    The policy is enforced correctly. The key here is how Cilium does this. When the policy is applied, Cilium's eBPF programs on the api-server's node are updated. For traffic on port 80, instead of just allowing it, the eBPF program redirects the initial packets of the TCP stream to a lightweight, embedded proxy (like Envoy or a custom parser) running in the Cilium agent. This proxy inspects the initial HTTP data, makes a policy decision, and communicates the verdict back to the eBPF program, which then either allows the connection to proceed or terminates it. This is far more efficient than a full sidecar proxy for every pod.


    Kernel-Level Observability with Hubble

    One of the most powerful benefits of an eBPF-based data plane is the deep, low-overhead observability it provides. Since eBPF programs see every packet, we can capture rich metadata without any application instrumentation. Cilium's observability tool, Hubble, reads this data directly from BPF maps.

    First, enable the Hubble UI:

    bash
    cilium hubble enable --ui
    cilium hubble port-forward &

    Now, let's use the Hubble CLI to see exactly why our requests were dropped.

    bash
    # Run this command and then re-run the failed curl from the client pod
    hubble observe --from-pod default/client --to-pod default/api-server -f

    You will see real-time flow data. When the allowed request runs, you'll see a L7 FORWARDED verdict:

    text
    TIMESTAMP           SOURCE:PORT -> DESTINATION:PORT   TYPE   VERDICT     SUMMARY
    May 20 15:30:01.123 default/client:34567 -> default/api-server:80 http-request  FORWARDED   GET http://api-service/public

    When the denied request for /private runs, you'll see a clear DROPPED verdict with a reason:

    text
    TIMESTAMP           SOURCE:PORT -> DESTINATION:PORT   TYPE   VERDICT     SUMMARY
    May 20 15:30:15.456 default/client:34589 -> default/api-server:80 http-request  DROPPED     Policy denied (L7)

    This is a game-changer for debugging. Instead of guessing or parsing iptables --verbose logs, you get a definitive, human-readable reason for the packet drop, including the L7 metadata that triggered the policy. You can even use hubble ui to see a graphical service map and flow animations.


    Performance Considerations and Benchmarking Insights

    While functionality is great, the primary driver for adopting eBPF in large clusters is performance. Let's quantify the difference.

    Theoretical Analysis:

    * iptables kube-proxy: Service lookup is O(n) where n is the number of services. For N services, the worst-case lookup requires traversing N rules.

    * Cilium eBPF: Service lookup is O(1). The service virtual IP is a key in a BPF hash map. The lookup is constant time, regardless of the number of services.

    Practical Impact:

    In a benchmark conducted by the Cilium community on a 30-node cluster, they compared the performance of kube-proxy in iptables mode vs. Cilium's eBPF mode as the number of services increased.

    Number of Servicesiptables p99 Latency (ms)Cilium eBPF p99 Latency (ms)
    1,000~2~0.5
    5,000~8~0.5
    10,000~16~0.5
    20,000~30+~0.5

    Source: Cilium project benchmarks (data is illustrative of typical results)

    The results are stark. Cilium's latency remains flat and sub-millisecond, while iptables latency grows linearly and becomes a significant performance bottleneck in large clusters.

    Edge Case: JIT Compilation Overhead

    While per-packet processing is faster, eBPF is not without overhead. eBPF bytecode is loaded into the kernel and often Just-In-Time (JIT) compiled into native machine code for maximum performance. This JIT compilation consumes CPU cycles when a program is first loaded or updated. In environments with extremely high pod churn, this can lead to noticeable CPU spikes on nodes as Cilium constantly regenerates and loads new eBPF programs for new pod interfaces. This is a trade-off: a small, one-time cost for a massive gain in steady-state data plane performance.


    Advanced Production Patterns and Troubleshooting

    For senior engineers, deploying a new technology means understanding how to debug it when things go wrong.

    Pattern: Direct Server Return (DSR)

    In standard NAT-based service routing (both iptables and eBPF), the return packet must travel back through the node that made the load balancing decision to have its source IP un-NAT'd. eBPF enables a more advanced mode called Direct Server Return (DSR). In DSR mode, the backend pod sends the reply directly to the original client, bypassing the load-balancing node entirely on the return path. This further reduces latency and removes potential bottlenecks. This is configured in Cilium via loadBalancer.mode=dsr.

    Troubleshooting: Inspecting BPF Maps Directly

    When Hubble isn't enough and you need to go deeper, you can inspect the BPF maps that Cilium uses to store state. This is the eBPF equivalent of dumping iptables rules.

  • Find a Cilium Pod: CILIUM_POD=$(kubectl get pods -n kube-system -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}')
  • Exec into the Pod: kubectl exec -it -n kube-system $CILIUM_POD -- bash
  • List Endpoints: cilium endpoint list (Find the ID of your api-server pod).
  • Inspect the Connection Tracking Table:
  • bash
        # cilium bpf ct list <endpoint-id>
        # This shows you the live TCP connections being tracked by eBPF
        # for that specific pod, including NAT state.
  • Inspect the Policy Map:
  • bash
        # cilium bpf policy get <endpoint-id>
        # This dumps the policy rules applied to the endpoint as they exist
        # within the BPF map, showing allowed source identities and ports.

    This level of inspection allows you to verify, at the lowest level, exactly what rules the kernel is enforcing for a given pod, providing ultimate clarity during complex troubleshooting scenarios.

    Conclusion: A New Foundation for Cloud-Native Networking

    Moving from iptables to eBPF for Kubernetes networking is more than an optimization; it's an architectural evolution. It addresses the fundamental scaling limitations of the Netfilter-based model while simultaneously unlocking capabilities like L7-aware policies and low-overhead observability that were previously the exclusive domain of heavy service meshes.

    For senior engineers responsible for the stability, performance, and security of large-scale clusters, understanding and harnessing eBPF is becoming a non-negotiable skill. By replacing kube-proxy and leveraging tools like Cilium, we can build a data plane that is not only faster and more efficient but also more secure and transparent, providing a solid foundation for the next generation of cloud-native applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles