eBPF in Prod: L7-Aware Cilium Policies for gRPC Microservices

12 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inadequacy of L4 Policies in a gRPC World

In modern microservice architectures, particularly those built on gRPC, standard Kubernetes NetworkPolicy resources fall critically short. They operate at L3/L4 (IP address and port), effectively creating a digital wall around a pod. While this prevents unauthorized pods from initiating a connection, it's a blunt instrument. Once a connection is established, the NetworkPolicy has no visibility into the L7 data being transmitted.

Consider a typical scenario: a payments-service exposes a gRPC server on port 50051 with multiple RPC methods:

  • ProcessPayment(PaymentRequest) returns (PaymentResponse)
  • GetPaymentStatus(StatusRequest) returns (StatusResponse)
  • IssueRefund(RefundRequest) returns (RefundResponse)
  • GenerateFinancialReport(ReportRequest) returns (ReportResponse) (highly privileged)
  • An L4 NetworkPolicy can allow an order-service to connect to the payments-service on port 50051. However, it cannot prevent that order-service from calling the sensitive GenerateFinancialReport method. This is a significant security gap. The traditional solution involves implementing authorization logic within the application code or using a service mesh sidecar proxy like Envoy or Linkerd, which intercepts, terminates TLS, inspects the request, and then makes a policy decision in userspace. This approach, while functional, introduces non-trivial latency, resource overhead, and operational complexity for every single pod.

    This is where Cilium's eBPF-powered datapath offers a paradigm shift. By attaching eBPF programs to kernel hooks (like Traffic Control), Cilium can parse L7 protocols like gRPC directly in the kernel for plaintext traffic. This allows for policy enforcement at near-native kernel speed, drastically reducing the overhead associated with userspace proxies for this specific task.

    This article will demonstrate how to implement and verify these advanced L7 gRPC policies in a production-like scenario.

    Scenario Setup: Multi-Service gRPC Application

    Let's define our Kubernetes environment. We'll have three services:

  • api-gateway: The public-facing entry point. It can call payments-service and inventory-service.
  • payments-service: Exposes the gRPC methods described earlier. Only api-gateway should call ProcessPayment and GetPaymentStatus. Only finance-batch-job can call GenerateFinancialReport.
  • finance-batch-job: A privileged service that needs access to the reporting endpoint.
  • gRPC Server Implementation (`payments-service`)

    Here's a simplified Go implementation for our payments-service. The key is to have distinct, identifiable methods.

    go
    // main.go for payments-service
    package main
    
    import (
    	"context"
    	"fmt"
    	"log"
    	"net"
    
    	pb "path/to/your/proto/payments"
    	"google.golang.org/grpc"
    )
    
    const port = ":50051"
    
    type server struct{
    	pb.UnimplementedPaymentsServer
    }
    
    func (s *server) ProcessPayment(ctx context.Context, in *pb.PaymentRequest) (*pb.PaymentResponse, error) {
    	log.Printf("Received ProcessPayment for amount: %v", in.GetAmount())
    	return &pb.PaymentResponse{TransactionId: "tx_12345", Status: "SUCCESS"}, nil
    }
    
    func (s *server) GetPaymentStatus(ctx context.Context, in *pb.StatusRequest) (*pb.StatusResponse, error) {
    	log.Printf("Received GetPaymentStatus for ID: %v", in.GetTransactionId())
    	return &pb.StatusResponse{Status: "CONFIRMED"}, nil
    }
    
    func (s *server) IssueRefund(ctx context.Context, in *pb.RefundRequest) (*pb.RefundResponse, error) {
    	log.Printf("Received IssueRefund for ID: %v", in.GetTransactionId())
    	return &pb.RefundResponse{Status: "REFUND_PROCESSED"}, nil
    }
    
    func (s *server) GenerateFinancialReport(ctx context.Context, in *pb.ReportRequest) (*pb.ReportResponse, error) {
    	log.Printf("Received GenerateFinancialReport for period: %v", in.GetPeriod())
    	return &pb.ReportResponse{ReportUrl: "s3://bucket/report.csv"}, nil
    }
    
    func main() {
    	lis, err := net.Listen("tcp", port)
    	if err != nil {
    		log.Fatalf("failed to listen: %v", err)
    	}
    	s := grpc.NewServer()
    	pb.RegisterPaymentsServer(s, &server{})
    	log.Printf("server listening at %v", lis.Addr())
    	if err := s.Serve(lis); err != nil {
    		log.Fatalf("failed to serve: %v", err)
    	}
    }

    Kubernetes Deployments

    We'll create deployments for our services, ensuring they have appropriate labels for policy selection.

    yaml
    # payments-service-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: payments-service
      labels:
        app: payments-service
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: payments-service
      template:
        metadata:
          labels:
            app: payments-service
            role: backend
        spec:
          containers:
          - name: payments-service
            image: your-repo/payments-service:v1
            ports:
            - containerPort: 50051
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: payments-service
    spec:
      selector:
        app: payments-service
      ports:
      - protocol: TCP
        port: 50051
        targetPort: 50051

    Deployments for api-gateway (with label app: api-gateway) and finance-batch-job (with label app: finance-batch-job) would be similar.

    Crafting the L7 gRPC CiliumNetworkPolicy

    With our services deployed, we can now define the CiliumNetworkPolicy (CNP). This CRD extends the standard NetworkPolicy with powerful L7 capabilities.

    Our goal is to enforce the following rules on ingress traffic to payments-service:

  • Default Deny: No traffic is allowed unless explicitly permitted.
  • api-gateway Access: Pods with app: api-gateway can call ProcessPayment and GetPaymentStatus.
  • finance-batch-job Access: Pods with app: finance-batch-job can only call GenerateFinancialReport.
  • Here is the complete CNP to achieve this:

    yaml
    # payments-service-l7-policy.yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: payments-service-grpc-policy
      namespace: default
    spec:
      endpointSelector:
        matchLabels:
          app: payments-service
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: api-gateway
        toPorts:
        - ports:
          - port: "50051"
            protocol: TCP
          rules:
            l7proto: "grpc"
            l7:
            - method: "/payments.Payments/ProcessPayment"
            - method: "/payments.Payments/GetPaymentStatus"
    
      - fromEndpoints:
        - matchLabels:
            app: finance-batch-job
        toPorts:
        - ports:
          - port: "50051"
            protocol: TCP
          rules:
            l7proto: "grpc"
            l7:
            - method: "/payments.Payments/GenerateFinancialReport"

    Dissecting the Advanced CNP Components

  • endpointSelector: This is standard, targeting the pods we want to protect (app: payments-service).
  • ingress: We define a list of ingress rules. Cilium policies are whitelist-based, so any traffic not matching a rule will be dropped.
  • fromEndpoints: This selects the source pods allowed by the rule. We have two separate rule blocks for our two different clients.
  • toPorts: This section contains the L7 magic.
  • - ports: We specify the L4 port (50051).

    - rules.l7proto: "grpc": This is the critical directive. It tells Cilium's eBPF datapath to engage the gRPC parser for traffic on this port. Without this, Cilium would treat it as opaque TCP.

    - rules.l7: This is an array of L7 rules. For gRPC, the key is method. The format is crucial: /package.Service/MethodName. You can find this exact string in your generated protobuf code.

    By creating two separate ingress blocks, we achieve a clean separation of concerns. The api-gateway has its permissions, and the finance-batch-job has its own, more restrictive set. Neither can access the other's allowed methods, and crucially, api-gateway cannot call the IssueRefund method at all, as it's not whitelisted.

    Verification and Observability with Hubble

    Defining a policy is one thing; verifying it works and debugging it when it doesn't is another. This is where Hubble, Cilium's observability tool, becomes indispensable.

    First, let's test the policy. From a shell inside the api-gateway pod, we'll use a gRPC client tool like grpcurl.

    bash
    # From within the api-gateway pod
    
    # This call should SUCCEED
    $ grpcurl -plaintext payments-service:50051 payments.Payments/ProcessPayment
    # ... gRPC success response ...
    
    # This call should FAIL (hang and timeout)
    $ grpcurl -plaintext payments-service:50051 payments.Payments/GenerateFinancialReport
    # ... rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp <ip>:50051: operation was canceled" ...

    The failing call doesn't get a clean 'permission denied' at the gRPC level because the packets are dropped by eBPF in the kernel before they even reach the payments-service application's TCP stack. The client sees this as a connection timeout or transport-level failure.

    Now, let's use Hubble to see exactly why it failed.

    bash
    # On a node with Hubble CLI installed
    # Follow the Cilium logs and filter for verdicts related to our service
    cilium monitor --type drop -v

    You would see log entries indicating packet drops from the api-gateway pod's IP to the payments-service pod's IP on port 50051.

    For a more intuitive view, we use the hubble CLI:

    bash
    # Observe traffic flows from api-gateway to payments-service
    hubble observe --from app=api-gateway --to app=payments-service --follow

    When you make the calls, you'll see real-time verdicts:

    Successful Call:

    text
    TIMESTAMP     SOURCE              DESTINATION          TYPE          VERDICT     SUMMARY
    ...           api-gateway-xyz -> payments-service-abc L7_REQUEST    FORWARDED   grpc: path:/payments.Payments/ProcessPayment
    ...           payments-service-abc -> api-gateway-xyz L7_RESPONSE   FORWARDED   grpc: status:0

    Denied Call:

    text
    TIMESTAMP     SOURCE              DESTINATION          TYPE          VERDICT     SUMMARY
    ...           api-gateway-xyz -> payments-service-abc L7_REQUEST    DROPPED     Policy denied

    Hubble's output is unambiguous. It shows the L7_REQUEST was parsed, identified as a gRPC call to a specific method, and then DROPPED with the reason Policy denied. This level of kernel-level observability is a superpower for debugging complex network policies.

    Edge Case: Handling TLS-Encrypted gRPC

    Our example used plaintext gRPC for simplicity. In production, you'll almost certainly use TLS. This presents a challenge: eBPF operates in the kernel and cannot, by itself, access the keys needed to decrypt TLS traffic. It sees only an encrypted stream of bytes.

    Cilium solves this through a hybrid approach, integrating with userspace proxies like its own embedded Envoy.

  • Traffic Redirection: The eBPF program at the TC hook is still the first point of contact. It can perform L3/L4 filtering based on the CNP.
  • Proxy Injection: If the policy includes L7 rules for a TLS-enabled port, Cilium's eBPF program transparently redirects the traffic to an Envoy proxy running in the pod's network namespace. This is more efficient than a full sidecar model because not all traffic needs to pass through the proxy—only the traffic requiring L7 inspection.
  • TLS Termination & Policy Enforcement: Envoy terminates the TLS, inspects the decrypted gRPC call, and enforces the L7 policy rules defined in the CNP.
  • Forwarding: If the call is allowed, Envoy forwards it to the application container over the local loopback interface.
  • To enable this, you must annotate the port in your CiliumNetworkPolicy to specify the L7 parser and direct it to the proxy.

    yaml
    # ... inside the toPorts section of the CNP ...
        toPorts:
        - ports:
          - port: "50051"
            protocol: TCP
          terminatingTLS:
            secret:
              name: payments-tls-server-secret # K8s secret with tls.crt and tls.key
          originatingTLS:
            secret:
              name: payments-tls-client-secret # For proxy-to-app TLS
          rules:
            l7proto: "grpc"
            l7:
            - method: "/payments.Payments/ProcessPayment"

    This configuration instructs Cilium to provision the Envoy proxy with the necessary TLS certificates to terminate the connection. The L7 rule definition remains identical.

    Performance Implications: Kernel vs. Userspace

    The performance difference between these two models is significant.

  • Plaintext (Pure eBPF): Policy enforcement happens entirely in the kernel. The overhead is measured in microseconds. Packets are parsed and dropped/forwarded without ever context-switching to userspace or traversing the full TCP/IP stack twice. This is the ideal state for performance-critical services.
  • TLS (eBPF + Envoy): The path involves kernel -> userspace (Envoy) -> kernel -> userspace (application). This introduces latency from:
  • - Two extra passes through the network stack.

    - The computational cost of TLS decryption/encryption.

    - The processing logic within Envoy.

    While this is still highly optimized, the latency penalty can be an order of magnitude higher than the pure eBPF path. Therefore, a key architectural consideration is whether certain internal, high-trust, high-throughput gRPC services can operate without TLS to leverage the full performance benefits of kernel-level L7 enforcement. For any service crossing a trust boundary, the eBPF+Envoy model provides the necessary security with performance that is still superior to many traditional service mesh implementations due to eBPF's intelligent traffic redirection.

    Conclusion: A New Frontier in Cloud-Native Security

    Moving beyond L4 security is no longer optional for complex microservice deployments. By leveraging Cilium and eBPF, senior engineers can implement granular, method-level authorization for gRPC services directly at the kernel layer. This approach provides a security posture that was previously only achievable with resource-intensive userspace service meshes.

    The key takeaways for production implementation are:

  • L7 is Declarative: Define gRPC method permissions in the same way you define L3/L4 rules, using the CiliumNetworkPolicy CRD.
  • Performance is Paramount: For internal, plaintext gRPC, the performance gains of in-kernel enforcement are substantial. Architect your services to take advantage of this where possible.
  • TLS is a Solved Problem: For encrypted traffic, Cilium's hybrid eBPF-Envoy model provides a robust and efficient solution for L7 inspection without the full overhead of a traditional sidecar-for-everything architecture.
  • Observability is Non-Negotiable: Use tools like Hubble to visualize, verify, and debug policy decisions in real-time. The ability to see L7 metadata on dropped packets is invaluable for troubleshooting in a zero-trust environment.
  • By mastering these patterns, you can build Kubernetes platforms that are not only more secure but also more performant and observable, directly addressing the core challenges of network security at scale.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles