Granular K8s Network Policy with eBPF and Cilium's `toFQDNs`

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Brittle Reality of IP-Based Egress Policies

In any mature Kubernetes environment, controlling egress traffic is a foundational security requirement. The principle of least privilege dictates that a pod should only be able to communicate with the specific external services it requires. The standard Kubernetes NetworkPolicy resource provides a mechanism for this, but its reliance on IP addresses and CIDR blocks creates a significant operational burden in modern cloud architectures.

Consider a payment-processor microservice that needs to communicate with Stripe's API (api.stripe.com) and a managed PostgreSQL database on AWS RDS (prod-db.cxyz.us-east-1.rds.amazonaws.com). A naive implementation using NetworkPolicy would look like this:

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-payment-processor-egress
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-processor
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 13.248.140.0/24 # A current IP range for api.stripe.com
    ports:
    - protocol: TCP
      port: 443
  - to:
    - ipBlock:
        cidr: 52.95.243.0/24 # A current IP range for AWS RDS
    ports:
    - protocol: TCP
      port: 5432

This approach is fundamentally flawed for two reasons:

  • IP Volatility: The IP addresses for services fronted by CDNs (like Stripe) and cloud provider endpoints (like RDS) are not static. They are subject to change without notice due to load balancing, regional failovers, or infrastructure updates. A change in the upstream IP address will instantly break this policy, causing a production outage for the payment-processor service.
  • Operational Overhead: To mitigate this, engineering teams are forced into a reactive cycle of monitoring DNS records, updating CIDR blocks in their policies, and redeploying. This is manual, error-prone, and unsustainable at scale.
  • This brittleness forces a choice between security and reliability. Many teams opt for overly permissive egress rules (e.g., 0.0.0.0/0), effectively abandoning the principle of least privilege and widening their attack surface.

    The eBPF Advantage: Identity-Aware Policy Enforcement in the Kernel

    This is where Cilium and eBPF provide a paradigm shift. Instead of relying on ephemeral network identifiers like IP addresses, Cilium establishes a strong, cryptographically verifiable identity for each workload (based on Kubernetes ServiceAccounts or labels). Policies are then written based on these stable identities. For external services, Cilium extends this concept by treating a Fully Qualified Domain Name (FQDN) as a form of identity.

    Cilium bypasses the cumbersome and inefficient iptables chains used by many CNI plugins. Instead, it attaches lightweight eBPF (extended Berkeley Packet Filter) programs directly to kernel hooks, such as the socket layer (connect(), sendmsg()) and the networking stack's Traffic Control (TC) ingress/egress points.

    For DNS-aware policies, this means:

    * High Performance: An eBPF map lookup to check if a destination IP is valid for a given FQDN is an O(1) or O(log n) operation, far more efficient than traversing a linear iptables chain.

    * Dynamic Updates: eBPF maps can be updated atomically from user-space without reloading rules or interrupting existing connections. This allows Cilium's agent to dynamically update the allowed IP set for an FQDN as DNS records change.

    * Protocol-Level Visibility: eBPF can inspect packet data, allowing Cilium to intercept DNS requests/responses directly to build its FQDN-to-IP mapping.

    Deep Dive: Implementing DNS-Aware Policies with `CiliumNetworkPolicy`

    Let's solve our initial problem using a CiliumNetworkPolicy (CNP) custom resource. The key feature is the toFQDNs stanza within an egress rule.

    Scenario Setup:

    We have a payment-processor pod in the payments namespace. It needs to make HTTPS calls to api.stripe.com and connect to our RDS instance at prod-db.cxyz.us-east-1.rds.amazonaws.com.

    First, we'll create a test pod to simulate our application:

    yaml
    # test-pod.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: payment-processor-test
      namespace: payments
      labels:
        app: payment-processor
    spec:
      containers:
      - name: test-container
        image: alpine/curl:latest
        command: ["sleep", "3600"]

    Now, let's define the CiliumNetworkPolicy:

    yaml
    # fqdn-policy.yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "allow-payment-processor-fqdn-egress"
      namespace: payments
    spec:
      endpointSelector:
        matchLabels:
          app: payment-processor
      egress:
      - toEndpoints:
        - matchLabels:
            'k8s:io.kubernetes.pod.namespace': kube-system
            'k8s:k8s-app': kube-dns
        toPorts:
        - ports:
          - port: "53"
            protocol: UDP
          rules:
            dns:
            - matchPattern: "*"
      - toFQDNs:
        - matchName: "api.stripe.com"
        - matchPattern: "*.rds.amazonaws.com"
        toPorts:
        - ports:
          - port: "443"
            protocol: TCP
          - port: "5432"
            protocol: TCP

    Let's break this down:

  • endpointSelector: This targets the policy to apply to any pod with the label app: payment-processor.
  • First Egress Rule (DNS): This is a critical and often overlooked component. Before the pod can connect to an FQDN, it must be able to resolve it via DNS. This rule explicitly allows pods matching the selector to send DNS queries (UDP port 53) to the kube-dns service in the kube-system namespace. Without this, all FQDN lookups would fail, and the policy would be ineffective.
  • Second Egress Rule (toFQDNs): This is the core of our solution.
  • * matchName: "api.stripe.com": This is an exact match. Only traffic to the FQDN api.stripe.com is allowed.

    matchPattern: ".rds.amazonaws.com": This is a wildcard match. It allows traffic to any subdomain of rds.amazonaws.com, which is useful for cloud provider services where hostnames might change but the parent domain remains constant.

    * toPorts: This restricts the allowed communication to TCP port 443 (for Stripe) and TCP port 5432 (for RDS).

    Apply these resources:

    bash
    kubectl create namespace payments
    kubectl apply -f test-pod.yaml
    kubectl apply -f fqdn-policy.yaml

    Now, let's test the connectivity from our pod:

    bash
    # Test allowed FQDNs
    kubectl -n payments exec -it payment-processor-test -- curl -v https://api.stripe.com
    # This should succeed with a TLS handshake.
    
    # Test a disallowed FQDN
    kubectl -n payments exec -it payment-processor-test -- curl -v --connect-timeout 5 https://api.github.com
    # This will hang and result in a connection timeout.

    The curl to api.github.com fails not at the DNS level (which we allowed), but at the TCP connect() level. The pod resolves the IP for api.github.com, but when it attempts to establish a connection, the eBPF program attached by Cilium drops the packet because that destination IP is not associated with an allowed FQDN for this pod's identity.

    Under the Hood: The DNS Proxy and eBPF Map Lifecycle

    To truly understand why this is so robust, we must examine the internal mechanics of how Cilium processes DNS-aware policies. The process involves a collaboration between an eBPF program in the kernel and a DNS proxy running in the Cilium agent's user-space.

    Here is the step-by-step lifecycle of a connection:

  • DNS Request Interception: The application in the payment-processor-test pod executes a DNS lookup for api.stripe.com. This sends a UDP packet destined for kube-dns on port 53.
  • eBPF Redirection: A Cilium-managed eBPF program attached to the pod's network interface (e.g., at the TC layer) inspects the outgoing packet. It identifies it as a DNS query and, instead of letting it proceed directly to kube-dns, redirects it to the Cilium agent's local DNS proxy.
  • Proxy Forwarding & Response Caching: The Cilium agent receives the query. It forwards the request to the actual upstream DNS server (kube-dns). When kube-dns responds with the IP address(es) for api.stripe.com, the Cilium agent does two things:
  • * It forwards the DNS response back to the application pod, so the application can proceed.

    * It populates its internal cache, creating a mapping: api.stripe.com -> [IP_A, IP_B, ...], respecting the TTL (Time To Live) from the DNS record.

  • eBPF Map Population: This is the crucial step. The Cilium agent updates a kernel-level eBPF map, often called cilium_ipcache. It inserts entries that associate the source pod's security identity with the allowed destination IPs. The entry effectively says: "Pods with identity X are allowed to connect to IP_A because it resolves from the allowed FQDN api.stripe.com."
  • Connection Attempt & eBPF Enforcement: The application, having received the IP address, now initiates a TCP connection to IP_A on port 443. An eBPF program attached to the connect() syscall hook or the TC egress hook fires. This program performs a lookup in the cilium_ipcache map. It checks: "Does an entry exist allowing the source identity of this pod to connect to the destination IP_A?"
  • * If yes: The lookup succeeds, and the packet is allowed to proceed out of the pod's network namespace.

    * If no (e.g., the pod tries to connect to the IP of api.github.com): The lookup fails, and the eBPF program returns a drop verdict. The packet is discarded in the kernel before it ever leaves the node.

    This entire process is transparent to the application. It also ensures that even if an attacker compromises the pod and tries to bypass DNS by connecting directly to a malicious IP, the connection will be blocked unless that IP is currently a valid resolution for an allowed FQDN.

    Advanced Patterns and Edge Case Management

    Deploying FQDN policies in production requires consideration of several edge cases and performance nuances.

    1. Handling DNS TTLs and Cache Staleness

    A significant challenge is DNS record volatility, especially with low TTLs. If an IP for api.stripe.com changes and the Cilium agent's cache is stale, legitimate connections can be dropped.

    * Problem: A DNS record has a 60-second TTL. The Cilium agent caches the IP. At 59 seconds, the application initiates a connection, which is allowed. At 61 seconds, the upstream DNS changes the IP. The application's connection library (or the OS) still has the old IP cached and tries to use it. The eBPF program now drops the packet because the cilium_ipcache has been updated (or expired) based on the new TTL, and the old IP is no longer valid.

    * Solution: Cilium employs several strategies that can be configured in the cilium-config ConfigMap:

    * tofqdns-min-ttl: This setting enforces a minimum TTL for cached DNS responses. If a response comes back with a 5-second TTL, Cilium can be configured to cache it for a minimum of, say, 3600 seconds. This reduces DNS churn but increases the risk of using a stale IP.

    * tofqdns-enable-dns-compression: Reduces the size of DNS messages, which can be beneficial in high-throughput environments.

    * Proactive Prefetching: Cilium can prefetch DNS lookups for FQDNs in policies before they expire, ensuring the cache is always warm with the latest IPs. This is a key mechanism for ensuring high availability.

    Configuration example:

    bash
    kubectl -n kube-system edit cm cilium-config
    # ... add or modify these values
    # tofqdns-dns-reject-response-code: refused
    # tofqdns-enable-poller: "true" # Proactively poll DNS

    2. Wildcard Policies (`matchPattern`) vs. Security

    While matchPattern: "*.amazonaws.com" is convenient, it carries security risks. If a pod is compromised, an attacker could use the allowed wildcard policy to exfiltrate data to a malicious S3 bucket or EC2 instance they control under the amazonaws.com domain.

    Best Practice: Be as specific as possible. Prefer matchName over matchPattern. If a wildcard is necessary, make the pattern as restrictive as possible (e.g., prod-.us-east-1.rds.amazonaws.com is better than *.amazonaws.com).

    * Layered Defense: Combine FQDN policies with L7 policies where possible. For example, you can allow egress to api.third-party.com and then add an L7 rule that only allows GET /v1/data and denies POST /v1/upload.

    3. Performance and Resource Considerations

    * DNS Proxy Overhead: The redirection to the user-space DNS proxy adds a small amount of latency to DNS lookups compared to direct kernel handling. However, this is usually negligible as it only affects the initial lookup, not the data path of the established connection.

    * eBPF Map Sizing: The cilium_ipcache map consumes kernel memory. In an environment with thousands of pods and policies referencing thousands of FQDNs that resolve to many IPs, this map can grow large. Monitor the Cilium agent's memory usage and check Cilium's metrics for any signs of map pressure or dropped entries.

    * CPU Usage: The eBPF programs themselves are extremely efficient, but the Cilium agent's DNS proxy and policy management logic consume CPU. This is typically low but should be monitored on nodes with high connection churn.

    4. Debugging and Observability with Hubble

    When a connection fails, determining the cause is critical. Hubble, Cilium's observability platform, is indispensable for this.

    Let's simulate a failure and debug it.

  • First, remove the toFQDNs rule for api.github.com from our policy (or ensure it was never there).
    • Attempt the connection from the test pod:
    bash
        kubectl -n payments exec -it payment-processor-test -- curl -v --connect-timeout 5 https://api.github.com
    • Use the Hubble CLI to observe the traffic from this pod:
    bash
        # Make sure hubble port-forward is running in another terminal
        # kubectl -n kube-system port-forward svc/hubble-relay 4245:80
        
        hubble observe -n payments --pod payment-processor-test --to-fqdn api.github.com -f

    You will see output similar to this:

    text
    TIMESTAMP            SOURCE                              DESTINATION                         TYPE          VERDICT   SUMMARY
    Sep 26 15:30:10.123  payments/payment-processor-test     kube-dns (10.96.0.10)               DNS           FORWARDED Query api.github.com.
    Sep 26 15:30:10.456  payments/payment-processor-test     api.github.com (140.82.121.4)       l3/l4-egress  DROPPED   Policy denied

    This output is incredibly revealing:

    * The first line shows the DNS query for api.github.com was FORWARDED, confirming our DNS-specific egress rule is working correctly.

    * The second line shows the subsequent TCP packet to the resolved IP (140.82.121.4) was DROPPED at the l3/l4-egress layer with the reason Policy denied.

    This immediately tells us the problem is not DNS resolution but L3/L4 network policy enforcement, pointing directly to a missing or incorrect rule in our CiliumNetworkPolicy.

    Conclusion: Building Resilient and Secure Egress

    By leveraging CiliumNetworkPolicy with toFQDNs, we transform our egress security posture from a brittle, IP-based model to a resilient, identity-aware one. The tight integration of eBPF for kernel-level enforcement and a sophisticated user-space control plane for DNS-to-IP mapping allows for policies that are both highly secure and dynamically adaptive to the realities of modern cloud infrastructure.

    While the implementation requires a nuanced understanding of DNS behavior, TTLs, and the underlying eBPF mechanics, the operational stability and security gains are substantial. For senior engineers responsible for the reliability and security of Kubernetes clusters, mastering these patterns is no longer a niche skill but a core competency for building production-grade, cloud-native systems.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles