SPIFFE/SPIRE for Zero-Trust mTLS in Kubernetes Microservices

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Problem with Implicit Trust in Kubernetes

In a standard Kubernetes environment, network trust is often naively based on pod IP addresses. While NetworkPolicies provide L3/L4 segmentation, they don't cryptographically verify the identity of the workload itself. A compromised pod with a legitimate IP address can still communicate with services it shouldn't, or a sophisticated attacker could spoof IP packets. Traditional service meshes like Istio or Linkerd address this with mTLS, but they introduce significant operational complexity, resource overhead via sidecar injection, and often tightly couple identity management to the mesh's lifecycle.

For engineering teams focused purely on establishing a strong, portable, and automated identity foundation, a full service mesh can be overkill. The core problem isn't traffic management; it's the lack of a universal, cryptographically verifiable workload identity. How can Service A prove it is Service A to Service B, regardless of its IP address, the node it's running on, or even the cluster it resides in?

This is the precise problem SPIFFE (Secure Production Identity Framework for Everyone) and its production-ready implementation, SPIRE (the SPIFFE Runtime Environment), are designed to solve. SPIFFE provides a specification for a universal identity framework, while SPIRE is the machinery that issues these identities to workloads in a secure and automated fashion. This article will dissect the production implementation of SPIRE in Kubernetes to achieve zero-trust mTLS without the full overhead of a traditional service mesh.

Core Concepts: A Refresher for Architects

Before diving into YAML and Go code, let's align on the core SPIFFE/SPIRE primitives. We assume familiarity with the basics; our focus is on their interplay within a Kubernetes context.

SPIFFE ID: A URI-formatted string that represents a unique workload identity, e.g., spiffe://your-trust.domain/ns/production/sa/api-server. This is the name* of your service.

SVID (SPIFFE Verifiable Identity Document): The cryptographic document that proves a workload's SPIFFE ID. The most common format is an X.509 certificate, where the SPIFFE ID is embedded in the SAN (Subject Alternative Name) field. This is the workload's passport*.

Trust Bundle: A collection of public keys (as a CA certificate bundle) from a trust domain. A workload uses this bundle to verify the SVIDs presented by other workloads. This is the list of trusted passport issuers*.

Workload API: A local UNIX Domain Socket (UDS) that the SPIRE Agent exposes on each node. Workloads connect to this socket to securely request their SVIDs and the current trust bundle without needing any pre-configured secrets. This is the passport office* for the workload.

* Attestation: The process by which SPIRE verifies a workload's identity before issuing an SVID. In Kubernetes, this typically involves the SPIRE Agent inspecting properties of the pod, such as its namespace, service account, or labels, and presenting this evidence to the SPIRE Server.

This system decouples workload identity from the underlying network infrastructure, providing a powerful foundation for zero-trust security.

Architecting SPIRE in a Production Kubernetes Cluster

A production-grade SPIRE deployment in Kubernetes consists of two main components:

  • SPIRE Server: A centralized component, typically run as a StatefulSet, that manages the trust domain, registers workloads, and signs SVIDs. It requires persistent storage to maintain registration entries and its own signing keys.
  • SPIRE Agent: A DaemonSet that runs on every node in the cluster. It attests the identity of the node itself to the server, then exposes the Workload API to pods on that node, and performs workload attestation on their behalf.
  • Here is a high-level architectural diagram:

    mermaid
    graph TD
        subgraph Kubernetes Cluster
            subgraph Node 1
                A[Pod A - api-server] -- UDS --> B(SPIRE Agent)
                C[Pod C - database-client] -- UDS --> B
            end
            subgraph Node 2
                D[Pod D - reporting-svc] -- UDS --> E(SPIRE Agent)
            end
            subgraph Control Plane Namespace
                F[SPIRE Server - StatefulSet] <--> G{Persistent Volume}
            end
            B -- gRPC --> F
            E -- gRPC --> F
            A -- mTLS --> D
        end

    The Attestation Flow in Detail

  • Node Attestation: When a SPIRE Agent starts on a new node, it must first prove its own identity to the SPIRE Server. In Kubernetes, the k8s_psat (Kubernetes Projected Service Account Token) attestor is a common choice. The agent generates a token for its own service account, presents it to the server, which then validates it against the Kubernetes API server. Upon success, the server issues an SVID for the node itself.
  • Workload Attestation: When a pod (e.g., api-server) connects to the Workload API on its node, the SPIRE Agent needs to identify it. The agent uses the k8s workload attestor, which inspects the pod's properties via the kubelet API. It discovers the pod's namespace, service account name, labels, etc. The agent sends these properties (called "selectors") to the SPIRE Server.
  • Registration Matching: The SPIRE Server checks its list of pre-configured registration entries. It looks for an entry that matches the selectors presented by the agent for that workload. For example, an entry might say: "Any workload running with the service account api-server in the production namespace is entitled to the SPIFFE ID spiffe://your-trust.domain/ns/production/sa/api-server."
  • SVID Issuance: If a match is found, the SPIRE Server generates an SVID for the workload, signs it, and sends it back to the agent. The agent then delivers it to the pod over the Workload API socket.
  • This entire process is automated, continuous, and does not require manual secret distribution to workloads.

    Implementation Walkthrough: A Production-Grade Deployment

    Let's deploy SPIRE and two Go microservices that will use it to establish an mTLS connection.

    Prerequisites:

    * A Kubernetes cluster (e.g., kind, minikube, or a cloud provider)

    * kubectl configured

    * helm v3+

    Step 1: Deploying SPIRE with Helm

    We will use the official SPIFFE Helm chart. First, add the repository:

    bash
    helm repo add spiffe https://spiffe.github.io/helm-charts/
    helm repo update

    Now, create a spire-values.yaml file. This configuration is critical for a production setup.

    yaml
    # spire-values.yaml
    
    spire-server:
      # The trust domain for your organization
      trust_domain: "my-company.com"
    
      # Enable the Kubernetes workload attestor
      controller_manager:
        enabled: true
    
      # Use the k8s_psat node attestor for secure node attestation
      config:
        server:
          ca_subject:
            organization: ["My Company"]
            country: ["US"]
          plugins:
            NodeAttestor:
              k8s_psat:
                clusters:
                  # Use the cluster name you desire.
                  # This must match the agent configuration.
                  my-cluster:
                    service_account_allow_list: ["spire:spire-agent"]
    
            KeyManager:
              # In production, use a cloud KMS (e.g., 'aws_kms', 'gcp_kms')
              # For this demo, 'memory' is sufficient.
              memory: {}
    
    spire-agent:
      # Must match the cluster name in the server config
      cluster_name: "my-cluster"
    
      config:
        agent:
          plugins:
            NodeAttestor:
              k8s_psat: {}
            KeyManager:
              memory: {}
            WorkloadAttestor:
              k8s: {}
    
    # Enable the spire-controller-manager to use CRDs for registration
    controller-manager:
      enabled: true

    Deploy SPIRE to its own namespace:

    bash
    kubectl create namespace spire
    helm install spire spiffe/spire --namespace spire --values spire-values.yaml

    This will deploy the SPIRE Server StatefulSet, the SPIRE Agent DaemonSet, and the spire-controller-manager which allows us to manage registration entries via Kubernetes CRDs.

    Step 2: Dynamic Workload Registration via `ClusterSPIFFEID`

    Manually creating registration entries via spire-server entry create is not scalable or GitOps-friendly. The modern approach is to use the ClusterSPIFFEID Custom Resource.

    This CRD tells the spire-controller-manager to watch for pods matching certain criteria and automatically create the corresponding registration entries in the SPIRE Server.

    Let's define identities for two services: echo-server and echo-client.

    yaml
    # registration.yaml
    apiVersion: spiffeid.spiffe.io/v1beta1
    kind: ClusterSPIFFEID
    metadata:
      name: echo-server
    spec:
      spiffeIDTemplate: "spiffe://{{ .TrustDomain }}/ns/{{ .PodMeta.Namespace }}/sa/{{ .PodMeta.ServiceAccountName }}"
      podSelector:
        matchLabels:
          app: echo-server
      # This defines the SPIFFE ID of the node agent as the parent. It's a security best practice.
      downstream: true
    ---
    apiVersion: spiffeid.spiffe.io/v1beta1
    kind: ClusterSPIFFEID
    metadata:
      name: echo-client
    spec:
      spiffeIDTemplate: "spiffe://{{ .TrustDomain }}/ns/{{ .PodMeta.Namespace }}/sa/{{ .PodMeta.ServiceAccountName }}"
      podSelector:
        matchLabels:
          app: echo-client
      downstream: true

    Apply this manifest:

    kubectl apply -f registration.yaml

    Now, any pod with the label app: echo-server will be automatically granted a SPIFFE ID based on its namespace and service account. The same applies to echo-client.

    Step 3: Instrumenting Go Microservices

    This is where the magic happens. The application code must be aware of the SPIFFE Workload API to fetch its identity documents.

    First, let's create the Kubernetes manifests for our services. Note that we mount the SPIRE agent socket directory into the pod.

    yaml
    # microservices.yaml
    apiVersion: v1
    kind: Namespace
    metadata:
      name: my-app
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: echo-server-sa
      namespace: my-app
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: echo-client-sa
      namespace: my-app
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: echo-server
      namespace: my-app
    spec:
      ports:
      - port: 8443
        targetPort: 8443
      selector:
        app: echo-server
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: echo-server
      namespace: my-app
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: echo-server
      template:
        metadata:
          labels:
            app: echo-server
        spec:
          serviceAccountName: echo-server-sa
          containers:
          - name: echo-server
            # Replace with your actual image
            image: your-repo/echo-server:latest
            ports:
            - containerPort: 8443
            volumeMounts:
            - name: spire-agent-socket
              mountPath: /spire
              readOnly: true
          volumes:
          - name: spire-agent-socket
            hostPath:
              path: /run/spire/sockets
              type: Directory
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: echo-client
      namespace: my-app
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: echo-client
      template:
        metadata:
          labels:
            app: echo-client
        spec:
          serviceAccountName: echo-client-sa
          containers:
          - name: echo-client
            # Replace with your actual image
            image: your-repo/echo-client:latest
            env:
            - name: SERVER_ADDRESS
              value: "echo-server.my-app.svc:8443"
            - name: EXPECTED_SERVER_ID
              value: "spiffe://my-company.com/ns/my-app/sa/echo-server-sa"
            volumeMounts:
            - name: spire-agent-socket
              mountPath: /spire
              readOnly: true
          volumes:
          - name: spire-agent-socket
            hostPath:
              path: /run/spire/sockets
              type: Directory

    Go `echo-server` Implementation

    This server uses the go-spiffe library to fetch its SVID and trust bundle from the Workload API and configure an mTLS listener.

    go
    // server/main.go
    package main
    
    import (
        "context"
        "log"
        "net/http"
    
        "github.com/spiffe/go-spiffe/v2/workloadapi"
        "github.com/spiffe/go-spiffe/v2/spiffetls/listen"
    )
    
    func main() {
        ctx, cancel := context.WithCancel(context.Background())
        defer cancel()
    
        // Create a Workload API source to fetch SVIDs and trust bundles
        // The library automatically finds the socket path via the env var SPIFFE_ENDPOINT_SOCKET
        source, err := workloadapi.NewX509Source(ctx)
        if err != nil {
            log.Fatalf("Unable to create X509Source: %v", err)
        }
        defer source.Close()
    
        // Get our own SPIFFE ID to log it
        svid, err := source.GetX509SVID()
        if err != nil {
            log.Fatalf("Unable to get X509 SVID: %v", err)
        }
        log.Printf("Server SVID: %s", svid.ID)
    
        // Create a TLS listener that automatically handles SVID rotation
        listener, err := listen.WithTLSServer(ctx, &http.Server{Addr: ":8443"}, source)
        if err != nil {
            log.Fatalf("Unable to create TLS listener: %v", err)
        }
        defer listener.Close()
    
        http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
            // The spiffetls listener automatically validates the client certificate
            // and we can access the client's SPIFFE ID from the request context.
            clientID := r.TLS.PeerCertificates[0].URIs[0].String()
            log.Printf("Received request from %s", clientID)
            w.Write([]byte("Hello, " + clientID))
        })
    
        log.Println("Server listening on :8443...")
        if err := http.Serve(listener, nil); err != nil {
            log.Fatalf("Error serving: %v", err)
        }
    }

    Go `echo-client` Implementation

    This client fetches its own SVID and uses it to dial the server, while also using the trust bundle to verify the server's identity.

    go
    // client/main.go
    package main
    
    import (
        "context"
        "crypto/tls"
        "io"
        "log"
        "net/http"
        "os"
        "time"
    
        "github.com/spiffe/go-spiffe/v2/spiffetls/dial"
        "github.com/spiffe/go-spiffe/v2/workloadapi"
    )
    
    func main() {
        ctx, cancel := context.WithCancel(context.Background())
        defer cancel()
    
        serverAddress := os.Getenv("SERVER_ADDRESS")
        expectedServerID := os.Getenv("EXPECTED_SERVER_ID")
    
        // Create a Workload API source
        source, err := workloadapi.NewX509Source(ctx)
        if err != nil {
            log.Fatalf("Unable to create X509Source: %v", err)
        }
        defer source.Close()
    
        // Get our own SPIFFE ID to log it
        svid, err := source.GetX509SVID()
        if err != nil {
            log.Fatalf("Unable to get X509 SVID: %v", err)
        }
        log.Printf("Client SVID: %s", svid.ID)
    
        // Create an http.Client that uses SPIFFE for mTLS
        client := &http.Client{
            Transport: &http.Transport{
                DialTLSContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
                    // dial.Dial dials a new connection and performs a TLS handshake, 
                    // authenticating the server using the SPIFFE trust bundle from the source
                    // and presenting the client SVID.
                    return dial.Dial(ctx, network, addr, source)
                },
            },
        }
    
        ticker := time.NewTicker(5 * time.Second)
        defer ticker.Stop()
    
        for range ticker.C {
            resp, err := client.Get("https://" + serverAddress)
            if err != nil {
                log.Printf("Error making request: %v", err)
                continue
            }
    
            // Here's the critical part: we must verify the server's SPIFFE ID
            // The library already verified the certificate against the trust bundle,
            // but we must perform authorization by checking the ID itself.
            connState := resp.TLS
            if connState == nil || len(connState.PeerCertificates) == 0 {
                log.Println("Server did not present a certificate")
                resp.Body.Close()
                continue
            }
    
            serverID := connState.PeerCertificates[0].URIs[0].String()
            if serverID != expectedServerID {
                log.Printf("Expected server ID %q but got %q", expectedServerID, serverID)
                resp.Body.Close()
                continue
            }
    
            log.Printf("Successfully connected to %s", serverID)
            body, _ := io.ReadAll(resp.Body)
            resp.Body.Close()
            log.Printf("Server response: %s", string(body))
        }
    }

    After building and pushing these images, deploying the microservices.yaml will result in the client pod successfully connecting to the server over mTLS, with both sides cryptographically verifying each other's identity.

    Advanced Patterns and Edge Cases

    Graceful SVID Rotation

    SVIDs are short-lived by default (e.g., 1-hour TTL). SPIRE automatically rotates them before they expire. The go-spiffe library handles this transparently. The workloadapi.X509Source maintains an internal cache of the latest SVID and trust bundle. It receives updates from the SPIRE agent in the background. When a new connection is established, listen.WithTLSServer or dial.Dial will automatically use the latest SVID. This ensures that long-running services don't fail when their certificates expire.

    However, applications with long-lived connections (like gRPC streams or WebSockets) need to be aware of this. A connection established with an old SVID will eventually be terminated by the peer when that SVID expires from its trust bundle. The best practice is to implement a periodic, graceful connection cycling mechanism at the application layer to ensure connections are re-established with fresh SVIDs.

    Inter-Cluster Federation

    What if your services span multiple Kubernetes clusters, or even different cloud providers? SPIFFE Federation is the answer. Federation allows two distinct SPIRE installations (in different trust domains) to trust each other.

    Let's say you have trust-domain-a and trust-domain-b.

  • Establish Trust: You create a FederationRelationship CRD on each SPIRE Server. This CRD points to the other trust domain's public endpoint and specifies which SPIFFE ID prefixes can be federated.
  • Bundle Exchange: The SPIRE Servers will securely exchange their trust bundles. Server A now has the CA certificates for Server B, and vice-versa.
  • Cross-Cluster Communication: When a client in Cluster A with SVID spiffe://trust-domain-a/client wants to talk to a server in Cluster B with SVID spiffe://trust-domain-b/server, the process works seamlessly. The client's X509Source will now contain the trust bundles for both domains. When the server in Cluster B presents its SVID, the client can validate it against the federated trust bundle.
  • This powerful feature enables secure multi-cloud architectures without complex VPNs or network-level peering, relying solely on cryptographic identity.

    Performance and Scalability Considerations

    * Workload API Overhead: The API is a local UNIX Domain Socket. Communication is extremely fast, with negligible latency. The overhead on the application is minimal.

    * SPIRE Agent Resource Usage: The agent is lightweight. Its primary tasks are watching the Kubelet API and proxying requests to the server. CPU and memory consumption are typically very low, making it suitable for even resource-constrained nodes.

    * SPIRE Server Scalability: The server is the central point of SVID issuance. For very large clusters (thousands of nodes, hundreds of thousands of pods), the server can become a bottleneck. Production best practices include:

    * Using a high-performance, external database (like PostgreSQL or MySQL) for the datastore instead of the default SQLite.

    * Scaling the SPIRE Server StatefulSet to multiple replicas for high availability (though only one is active at a time).

    * Monitoring the gRPC API latency and SVID signing rates.

    * SVID Caching: The SPIRE agent caches SVIDs. If the server is temporarily unavailable, the agent can continue to serve cached SVIDs to new and existing workloads on its node, providing a high degree of resilience.

    Conclusion: Decoupled Identity as a Security Primitive

    By implementing SPIFFE/SPIRE, we have established a robust, dynamic, and automated identity plane for our Kubernetes workloads. This approach offers several advantages over traditional methods:

    * Platform Agnostic: While we demonstrated this in Kubernetes, SPIRE can attest workloads on VMs, bare metal, or serverless platforms, providing a consistent identity story across your entire infrastructure.

    * Decoupled from Networking: Identity is not tied to an IP address. Pods can be rescheduled, change IPs, or move between clusters without breaking their cryptographic identity.

    * Reduced Operational Overhead: Compared to manual certificate management or configuring a full service mesh, SPIRE significantly reduces the burden on platform and security teams.

    Adopting a dedicated workload identity framework like SPIFFE/SPIRE is a fundamental step towards building true zero-trust systems. It shifts the security perimeter from the network to the application itself, providing a foundation for fine-grained authorization policies and verifiable, secure communication in any environment.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles