eBPF in Prod: L7-Aware Cilium Policies for gRPC Microservices
The Inadequacy of L4 Policies in a gRPC World
In modern microservice architectures, particularly those built on gRPC, standard Kubernetes NetworkPolicy resources fall critically short. They operate at L3/L4 (IP address and port), effectively creating a digital wall around a pod. While this prevents unauthorized pods from initiating a connection, it's a blunt instrument. Once a connection is established, the NetworkPolicy has no visibility into the L7 data being transmitted.
Consider a typical scenario: a payments-service exposes a gRPC server on port 50051 with multiple RPC methods:
ProcessPayment(PaymentRequest) returns (PaymentResponse)GetPaymentStatus(StatusRequest) returns (StatusResponse)IssueRefund(RefundRequest) returns (RefundResponse)GenerateFinancialReport(ReportRequest) returns (ReportResponse) (highly privileged)An L4 NetworkPolicy can allow an order-service to connect to the payments-service on port 50051. However, it cannot prevent that order-service from calling the sensitive GenerateFinancialReport method. This is a significant security gap. The traditional solution involves implementing authorization logic within the application code or using a service mesh sidecar proxy like Envoy or Linkerd, which intercepts, terminates TLS, inspects the request, and then makes a policy decision in userspace. This approach, while functional, introduces non-trivial latency, resource overhead, and operational complexity for every single pod.
This is where Cilium's eBPF-powered datapath offers a paradigm shift. By attaching eBPF programs to kernel hooks (like Traffic Control), Cilium can parse L7 protocols like gRPC directly in the kernel for plaintext traffic. This allows for policy enforcement at near-native kernel speed, drastically reducing the overhead associated with userspace proxies for this specific task.
This article will demonstrate how to implement and verify these advanced L7 gRPC policies in a production-like scenario.
Scenario Setup: Multi-Service gRPC Application
Let's define our Kubernetes environment. We'll have three services:
api-gateway: The public-facing entry point. It can call payments-service and inventory-service.payments-service: Exposes the gRPC methods described earlier. Only api-gateway should call ProcessPayment and GetPaymentStatus. Only finance-batch-job can call GenerateFinancialReport.finance-batch-job: A privileged service that needs access to the reporting endpoint.gRPC Server Implementation (`payments-service`)
Here's a simplified Go implementation for our payments-service. The key is to have distinct, identifiable methods.
// main.go for payments-service
package main
import (
"context"
"fmt"
"log"
"net"
pb "path/to/your/proto/payments"
"google.golang.org/grpc"
)
const port = ":50051"
type server struct{
pb.UnimplementedPaymentsServer
}
func (s *server) ProcessPayment(ctx context.Context, in *pb.PaymentRequest) (*pb.PaymentResponse, error) {
log.Printf("Received ProcessPayment for amount: %v", in.GetAmount())
return &pb.PaymentResponse{TransactionId: "tx_12345", Status: "SUCCESS"}, nil
}
func (s *server) GetPaymentStatus(ctx context.Context, in *pb.StatusRequest) (*pb.StatusResponse, error) {
log.Printf("Received GetPaymentStatus for ID: %v", in.GetTransactionId())
return &pb.StatusResponse{Status: "CONFIRMED"}, nil
}
func (s *server) IssueRefund(ctx context.Context, in *pb.RefundRequest) (*pb.RefundResponse, error) {
log.Printf("Received IssueRefund for ID: %v", in.GetTransactionId())
return &pb.RefundResponse{Status: "REFUND_PROCESSED"}, nil
}
func (s *server) GenerateFinancialReport(ctx context.Context, in *pb.ReportRequest) (*pb.ReportResponse, error) {
log.Printf("Received GenerateFinancialReport for period: %v", in.GetPeriod())
return &pb.ReportResponse{ReportUrl: "s3://bucket/report.csv"}, nil
}
func main() {
lis, err := net.Listen("tcp", port)
if err != nil {
log.Fatalf("failed to listen: %v", err)
}
s := grpc.NewServer()
pb.RegisterPaymentsServer(s, &server{})
log.Printf("server listening at %v", lis.Addr())
if err := s.Serve(lis); err != nil {
log.Fatalf("failed to serve: %v", err)
}
}
Kubernetes Deployments
We'll create deployments for our services, ensuring they have appropriate labels for policy selection.
# payments-service-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-service
labels:
app: payments-service
spec:
replicas: 1
selector:
matchLabels:
app: payments-service
template:
metadata:
labels:
app: payments-service
role: backend
spec:
containers:
- name: payments-service
image: your-repo/payments-service:v1
ports:
- containerPort: 50051
---
apiVersion: v1
kind: Service
metadata:
name: payments-service
spec:
selector:
app: payments-service
ports:
- protocol: TCP
port: 50051
targetPort: 50051
Deployments for api-gateway (with label app: api-gateway) and finance-batch-job (with label app: finance-batch-job) would be similar.
Crafting the L7 gRPC CiliumNetworkPolicy
With our services deployed, we can now define the CiliumNetworkPolicy (CNP). This CRD extends the standard NetworkPolicy with powerful L7 capabilities.
Our goal is to enforce the following rules on ingress traffic to payments-service:
api-gateway Access: Pods with app: api-gateway can call ProcessPayment and GetPaymentStatus.finance-batch-job Access: Pods with app: finance-batch-job can only call GenerateFinancialReport.Here is the complete CNP to achieve this:
# payments-service-l7-policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: payments-service-grpc-policy
namespace: default
spec:
endpointSelector:
matchLabels:
app: payments-service
ingress:
- fromEndpoints:
- matchLabels:
app: api-gateway
toPorts:
- ports:
- port: "50051"
protocol: TCP
rules:
l7proto: "grpc"
l7:
- method: "/payments.Payments/ProcessPayment"
- method: "/payments.Payments/GetPaymentStatus"
- fromEndpoints:
- matchLabels:
app: finance-batch-job
toPorts:
- ports:
- port: "50051"
protocol: TCP
rules:
l7proto: "grpc"
l7:
- method: "/payments.Payments/GenerateFinancialReport"
Dissecting the Advanced CNP Components
endpointSelector: This is standard, targeting the pods we want to protect (app: payments-service).ingress: We define a list of ingress rules. Cilium policies are whitelist-based, so any traffic not matching a rule will be dropped.fromEndpoints: This selects the source pods allowed by the rule. We have two separate rule blocks for our two different clients.toPorts: This section contains the L7 magic. - ports: We specify the L4 port (50051).
- rules.l7proto: "grpc": This is the critical directive. It tells Cilium's eBPF datapath to engage the gRPC parser for traffic on this port. Without this, Cilium would treat it as opaque TCP.
- rules.l7: This is an array of L7 rules. For gRPC, the key is method. The format is crucial: /package.Service/MethodName. You can find this exact string in your generated protobuf code.
By creating two separate ingress blocks, we achieve a clean separation of concerns. The api-gateway has its permissions, and the finance-batch-job has its own, more restrictive set. Neither can access the other's allowed methods, and crucially, api-gateway cannot call the IssueRefund method at all, as it's not whitelisted.
Verification and Observability with Hubble
Defining a policy is one thing; verifying it works and debugging it when it doesn't is another. This is where Hubble, Cilium's observability tool, becomes indispensable.
First, let's test the policy. From a shell inside the api-gateway pod, we'll use a gRPC client tool like grpcurl.
# From within the api-gateway pod
# This call should SUCCEED
$ grpcurl -plaintext payments-service:50051 payments.Payments/ProcessPayment
# ... gRPC success response ...
# This call should FAIL (hang and timeout)
$ grpcurl -plaintext payments-service:50051 payments.Payments/GenerateFinancialReport
# ... rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp <ip>:50051: operation was canceled" ...
The failing call doesn't get a clean 'permission denied' at the gRPC level because the packets are dropped by eBPF in the kernel before they even reach the payments-service application's TCP stack. The client sees this as a connection timeout or transport-level failure.
Now, let's use Hubble to see exactly why it failed.
# On a node with Hubble CLI installed
# Follow the Cilium logs and filter for verdicts related to our service
cilium monitor --type drop -v
You would see log entries indicating packet drops from the api-gateway pod's IP to the payments-service pod's IP on port 50051.
For a more intuitive view, we use the hubble CLI:
# Observe traffic flows from api-gateway to payments-service
hubble observe --from app=api-gateway --to app=payments-service --follow
When you make the calls, you'll see real-time verdicts:
Successful Call:
TIMESTAMP SOURCE DESTINATION TYPE VERDICT SUMMARY
... api-gateway-xyz -> payments-service-abc L7_REQUEST FORWARDED grpc: path:/payments.Payments/ProcessPayment
... payments-service-abc -> api-gateway-xyz L7_RESPONSE FORWARDED grpc: status:0
Denied Call:
TIMESTAMP SOURCE DESTINATION TYPE VERDICT SUMMARY
... api-gateway-xyz -> payments-service-abc L7_REQUEST DROPPED Policy denied
Hubble's output is unambiguous. It shows the L7_REQUEST was parsed, identified as a gRPC call to a specific method, and then DROPPED with the reason Policy denied. This level of kernel-level observability is a superpower for debugging complex network policies.
Edge Case: Handling TLS-Encrypted gRPC
Our example used plaintext gRPC for simplicity. In production, you'll almost certainly use TLS. This presents a challenge: eBPF operates in the kernel and cannot, by itself, access the keys needed to decrypt TLS traffic. It sees only an encrypted stream of bytes.
Cilium solves this through a hybrid approach, integrating with userspace proxies like its own embedded Envoy.
To enable this, you must annotate the port in your CiliumNetworkPolicy to specify the L7 parser and direct it to the proxy.
# ... inside the toPorts section of the CNP ...
toPorts:
- ports:
- port: "50051"
protocol: TCP
terminatingTLS:
secret:
name: payments-tls-server-secret # K8s secret with tls.crt and tls.key
originatingTLS:
secret:
name: payments-tls-client-secret # For proxy-to-app TLS
rules:
l7proto: "grpc"
l7:
- method: "/payments.Payments/ProcessPayment"
This configuration instructs Cilium to provision the Envoy proxy with the necessary TLS certificates to terminate the connection. The L7 rule definition remains identical.
Performance Implications: Kernel vs. Userspace
The performance difference between these two models is significant.
- Two extra passes through the network stack.
- The computational cost of TLS decryption/encryption.
- The processing logic within Envoy.
While this is still highly optimized, the latency penalty can be an order of magnitude higher than the pure eBPF path. Therefore, a key architectural consideration is whether certain internal, high-trust, high-throughput gRPC services can operate without TLS to leverage the full performance benefits of kernel-level L7 enforcement. For any service crossing a trust boundary, the eBPF+Envoy model provides the necessary security with performance that is still superior to many traditional service mesh implementations due to eBPF's intelligent traffic redirection.
Conclusion: A New Frontier in Cloud-Native Security
Moving beyond L4 security is no longer optional for complex microservice deployments. By leveraging Cilium and eBPF, senior engineers can implement granular, method-level authorization for gRPC services directly at the kernel layer. This approach provides a security posture that was previously only achievable with resource-intensive userspace service meshes.
The key takeaways for production implementation are:
CiliumNetworkPolicy CRD.By mastering these patterns, you can build Kubernetes platforms that are not only more secure but also more performant and observable, directly addressing the core challenges of network security at scale.