Granular K8s Network Policy with eBPF and Cilium's `toFQDNs`
The Brittle Reality of IP-Based Egress Policies
In any mature Kubernetes environment, controlling egress traffic is a foundational security requirement. The principle of least privilege dictates that a pod should only be able to communicate with the specific external services it requires. The standard Kubernetes NetworkPolicy resource provides a mechanism for this, but its reliance on IP addresses and CIDR blocks creates a significant operational burden in modern cloud architectures.
Consider a payment-processor microservice that needs to communicate with Stripe's API (api.stripe.com) and a managed PostgreSQL database on AWS RDS (prod-db.cxyz.us-east-1.rds.amazonaws.com). A naive implementation using NetworkPolicy would look like this:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-payment-processor-egress
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-processor
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 13.248.140.0/24 # A current IP range for api.stripe.com
ports:
- protocol: TCP
port: 443
- to:
- ipBlock:
cidr: 52.95.243.0/24 # A current IP range for AWS RDS
ports:
- protocol: TCP
port: 5432
This approach is fundamentally flawed for two reasons:
payment-processor service.This brittleness forces a choice between security and reliability. Many teams opt for overly permissive egress rules (e.g., 0.0.0.0/0), effectively abandoning the principle of least privilege and widening their attack surface.
The eBPF Advantage: Identity-Aware Policy Enforcement in the Kernel
This is where Cilium and eBPF provide a paradigm shift. Instead of relying on ephemeral network identifiers like IP addresses, Cilium establishes a strong, cryptographically verifiable identity for each workload (based on Kubernetes ServiceAccounts or labels). Policies are then written based on these stable identities. For external services, Cilium extends this concept by treating a Fully Qualified Domain Name (FQDN) as a form of identity.
Cilium bypasses the cumbersome and inefficient iptables chains used by many CNI plugins. Instead, it attaches lightweight eBPF (extended Berkeley Packet Filter) programs directly to kernel hooks, such as the socket layer (connect(), sendmsg()) and the networking stack's Traffic Control (TC) ingress/egress points.
For DNS-aware policies, this means:
* High Performance: An eBPF map lookup to check if a destination IP is valid for a given FQDN is an O(1) or O(log n) operation, far more efficient than traversing a linear iptables chain.
* Dynamic Updates: eBPF maps can be updated atomically from user-space without reloading rules or interrupting existing connections. This allows Cilium's agent to dynamically update the allowed IP set for an FQDN as DNS records change.
* Protocol-Level Visibility: eBPF can inspect packet data, allowing Cilium to intercept DNS requests/responses directly to build its FQDN-to-IP mapping.
Deep Dive: Implementing DNS-Aware Policies with `CiliumNetworkPolicy`
Let's solve our initial problem using a CiliumNetworkPolicy (CNP) custom resource. The key feature is the toFQDNs stanza within an egress rule.
Scenario Setup:
We have a payment-processor pod in the payments namespace. It needs to make HTTPS calls to api.stripe.com and connect to our RDS instance at prod-db.cxyz.us-east-1.rds.amazonaws.com.
First, we'll create a test pod to simulate our application:
# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: payment-processor-test
namespace: payments
labels:
app: payment-processor
spec:
containers:
- name: test-container
image: alpine/curl:latest
command: ["sleep", "3600"]
Now, let's define the CiliumNetworkPolicy:
# fqdn-policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "allow-payment-processor-fqdn-egress"
namespace: payments
spec:
endpointSelector:
matchLabels:
app: payment-processor
egress:
- toEndpoints:
- matchLabels:
'k8s:io.kubernetes.pod.namespace': kube-system
'k8s:k8s-app': kube-dns
toPorts:
- ports:
- port: "53"
protocol: UDP
rules:
dns:
- matchPattern: "*"
- toFQDNs:
- matchName: "api.stripe.com"
- matchPattern: "*.rds.amazonaws.com"
toPorts:
- ports:
- port: "443"
protocol: TCP
- port: "5432"
protocol: TCP
Let's break this down:
endpointSelector: This targets the policy to apply to any pod with the label app: payment-processor.kube-dns service in the kube-system namespace. Without this, all FQDN lookups would fail, and the policy would be ineffective.toFQDNs): This is the core of our solution. * matchName: "api.stripe.com": This is an exact match. Only traffic to the FQDN api.stripe.com is allowed.
matchPattern: ".rds.amazonaws.com": This is a wildcard match. It allows traffic to any subdomain of rds.amazonaws.com, which is useful for cloud provider services where hostnames might change but the parent domain remains constant.
* toPorts: This restricts the allowed communication to TCP port 443 (for Stripe) and TCP port 5432 (for RDS).
Apply these resources:
kubectl create namespace payments
kubectl apply -f test-pod.yaml
kubectl apply -f fqdn-policy.yaml
Now, let's test the connectivity from our pod:
# Test allowed FQDNs
kubectl -n payments exec -it payment-processor-test -- curl -v https://api.stripe.com
# This should succeed with a TLS handshake.
# Test a disallowed FQDN
kubectl -n payments exec -it payment-processor-test -- curl -v --connect-timeout 5 https://api.github.com
# This will hang and result in a connection timeout.
The curl to api.github.com fails not at the DNS level (which we allowed), but at the TCP connect() level. The pod resolves the IP for api.github.com, but when it attempts to establish a connection, the eBPF program attached by Cilium drops the packet because that destination IP is not associated with an allowed FQDN for this pod's identity.
Under the Hood: The DNS Proxy and eBPF Map Lifecycle
To truly understand why this is so robust, we must examine the internal mechanics of how Cilium processes DNS-aware policies. The process involves a collaboration between an eBPF program in the kernel and a DNS proxy running in the Cilium agent's user-space.
Here is the step-by-step lifecycle of a connection:
payment-processor-test pod executes a DNS lookup for api.stripe.com. This sends a UDP packet destined for kube-dns on port 53.kube-dns, redirects it to the Cilium agent's local DNS proxy.kube-dns). When kube-dns responds with the IP address(es) for api.stripe.com, the Cilium agent does two things:* It forwards the DNS response back to the application pod, so the application can proceed.
* It populates its internal cache, creating a mapping: api.stripe.com -> [IP_A, IP_B, ...], respecting the TTL (Time To Live) from the DNS record.
cilium_ipcache. It inserts entries that associate the source pod's security identity with the allowed destination IPs. The entry effectively says: "Pods with identity X are allowed to connect to IP_A because it resolves from the allowed FQDN api.stripe.com."IP_A on port 443. An eBPF program attached to the connect() syscall hook or the TC egress hook fires. This program performs a lookup in the cilium_ipcache map. It checks: "Does an entry exist allowing the source identity of this pod to connect to the destination IP_A?" * If yes: The lookup succeeds, and the packet is allowed to proceed out of the pod's network namespace.
* If no (e.g., the pod tries to connect to the IP of api.github.com): The lookup fails, and the eBPF program returns a drop verdict. The packet is discarded in the kernel before it ever leaves the node.
This entire process is transparent to the application. It also ensures that even if an attacker compromises the pod and tries to bypass DNS by connecting directly to a malicious IP, the connection will be blocked unless that IP is currently a valid resolution for an allowed FQDN.
Advanced Patterns and Edge Case Management
Deploying FQDN policies in production requires consideration of several edge cases and performance nuances.
1. Handling DNS TTLs and Cache Staleness
A significant challenge is DNS record volatility, especially with low TTLs. If an IP for api.stripe.com changes and the Cilium agent's cache is stale, legitimate connections can be dropped.
* Problem: A DNS record has a 60-second TTL. The Cilium agent caches the IP. At 59 seconds, the application initiates a connection, which is allowed. At 61 seconds, the upstream DNS changes the IP. The application's connection library (or the OS) still has the old IP cached and tries to use it. The eBPF program now drops the packet because the cilium_ipcache has been updated (or expired) based on the new TTL, and the old IP is no longer valid.
* Solution: Cilium employs several strategies that can be configured in the cilium-config ConfigMap:
* tofqdns-min-ttl: This setting enforces a minimum TTL for cached DNS responses. If a response comes back with a 5-second TTL, Cilium can be configured to cache it for a minimum of, say, 3600 seconds. This reduces DNS churn but increases the risk of using a stale IP.
* tofqdns-enable-dns-compression: Reduces the size of DNS messages, which can be beneficial in high-throughput environments.
* Proactive Prefetching: Cilium can prefetch DNS lookups for FQDNs in policies before they expire, ensuring the cache is always warm with the latest IPs. This is a key mechanism for ensuring high availability.
Configuration example:
kubectl -n kube-system edit cm cilium-config
# ... add or modify these values
# tofqdns-dns-reject-response-code: refused
# tofqdns-enable-poller: "true" # Proactively poll DNS
2. Wildcard Policies (`matchPattern`) vs. Security
While matchPattern: "*.amazonaws.com" is convenient, it carries security risks. If a pod is compromised, an attacker could use the allowed wildcard policy to exfiltrate data to a malicious S3 bucket or EC2 instance they control under the amazonaws.com domain.
Best Practice: Be as specific as possible. Prefer matchName over matchPattern. If a wildcard is necessary, make the pattern as restrictive as possible (e.g., prod-.us-east-1.rds.amazonaws.com is better than *.amazonaws.com).
* Layered Defense: Combine FQDN policies with L7 policies where possible. For example, you can allow egress to api.third-party.com and then add an L7 rule that only allows GET /v1/data and denies POST /v1/upload.
3. Performance and Resource Considerations
* DNS Proxy Overhead: The redirection to the user-space DNS proxy adds a small amount of latency to DNS lookups compared to direct kernel handling. However, this is usually negligible as it only affects the initial lookup, not the data path of the established connection.
* eBPF Map Sizing: The cilium_ipcache map consumes kernel memory. In an environment with thousands of pods and policies referencing thousands of FQDNs that resolve to many IPs, this map can grow large. Monitor the Cilium agent's memory usage and check Cilium's metrics for any signs of map pressure or dropped entries.
* CPU Usage: The eBPF programs themselves are extremely efficient, but the Cilium agent's DNS proxy and policy management logic consume CPU. This is typically low but should be monitored on nodes with high connection churn.
4. Debugging and Observability with Hubble
When a connection fails, determining the cause is critical. Hubble, Cilium's observability platform, is indispensable for this.
Let's simulate a failure and debug it.
toFQDNs rule for api.github.com from our policy (or ensure it was never there).- Attempt the connection from the test pod:
kubectl -n payments exec -it payment-processor-test -- curl -v --connect-timeout 5 https://api.github.com
- Use the Hubble CLI to observe the traffic from this pod:
# Make sure hubble port-forward is running in another terminal
# kubectl -n kube-system port-forward svc/hubble-relay 4245:80
hubble observe -n payments --pod payment-processor-test --to-fqdn api.github.com -f
You will see output similar to this:
TIMESTAMP SOURCE DESTINATION TYPE VERDICT SUMMARY
Sep 26 15:30:10.123 payments/payment-processor-test kube-dns (10.96.0.10) DNS FORWARDED Query api.github.com.
Sep 26 15:30:10.456 payments/payment-processor-test api.github.com (140.82.121.4) l3/l4-egress DROPPED Policy denied
This output is incredibly revealing:
* The first line shows the DNS query for api.github.com was FORWARDED, confirming our DNS-specific egress rule is working correctly.
* The second line shows the subsequent TCP packet to the resolved IP (140.82.121.4) was DROPPED at the l3/l4-egress layer with the reason Policy denied.
This immediately tells us the problem is not DNS resolution but L3/L4 network policy enforcement, pointing directly to a missing or incorrect rule in our CiliumNetworkPolicy.
Conclusion: Building Resilient and Secure Egress
By leveraging CiliumNetworkPolicy with toFQDNs, we transform our egress security posture from a brittle, IP-based model to a resilient, identity-aware one. The tight integration of eBPF for kernel-level enforcement and a sophisticated user-space control plane for DNS-to-IP mapping allows for policies that are both highly secure and dynamically adaptive to the realities of modern cloud infrastructure.
While the implementation requires a nuanced understanding of DNS behavior, TTLs, and the underlying eBPF mechanics, the operational stability and security gains are substantial. For senior engineers responsible for the reliability and security of Kubernetes clusters, mastering these patterns is no longer a niche skill but a core competency for building production-grade, cloud-native systems.