Production KEDA: Scaling Kubernetes on Custom Prometheus Metrics

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Queue Length: True Performance-Based Scaling with KEDA

For senior engineers building event-driven architectures on Kubernetes, the standard Horizontal Pod Autoscaler (HPA) presents a fundamental limitation. Its reliance on CPU and memory utilization is a poor proxy for the health of a message-processing workload. A consumer service can have low CPU usage while a message queue backlog grows, leading to unacceptable end-to-end latency. While KEDA (Kubernetes-based Event-Driven Autoscaling) elegantly solves this by scaling on external metrics like queue length, this is often just the first step.

Production systems demand more nuance. What if message processing time is non-uniform? A queue length of 1,000 simple messages is vastly different from 1,000 complex, CPU-intensive messages. Scaling purely on queue length can lead to over-provisioning or, worse, an inability to meet Service Level Objectives (SLOs) for processing time.

This article bypasses introductory concepts and dives directly into a production-grade pattern: scaling a Kubernetes deployment based on a custom application performance metric exposed via Prometheus. We will implement a complete solution where KEDA scales a Go-based RabbitMQ consumer not by how many messages are waiting, but by the actual p95 processing latency of the messages it's currently handling. This approach ensures that resources are allocated based on the actual performance of the system, creating a truly responsive and efficient architecture.

We will cover:

  • Instrumenting a Go application to expose custom Prometheus histogram metrics for processing latency.
  • Architecting a KEDA ScaledObject that uses the prometheus scaler with a sophisticated PromQL query.
  • Implementing a multi-trigger strategy, combining queue length for backlog clearance and latency for performance tuning.
  • Analyzing advanced edge cases, such as scale-to-zero behavior with custom metrics and preventing metric feedback loops.

  • The Scenario: A Latency-Sensitive Processing Pipeline

    Imagine a video processing pipeline. A producer service enqueues jobs to a RabbitMQ queue. Each job contains metadata for a video to be transcoded. A consumer service picks up these jobs, performs the transcoding, and signals completion. The complexity of transcoding can vary significantly based on video length and resolution.

    Our SLO is that the 95th percentile of job processing time must remain under 30 seconds. Scaling based on queue length is insufficient; a backlog of 100 short videos might be fine for one replica, while a backlog of 10 long videos could cripple it and violate our SLO.

    Here's our starting point: a basic consumer deployment and a RabbitMQ instance.

    The Consumer Application (Go)

    Our consumer will be written in Go. It will connect to RabbitMQ, process messages (simulated by a time.Sleep), and, crucially, export Prometheus metrics.

    main.go

    go
    package main
    
    import (
    	"log"
    	"math/rand"
    	"net/http"
    	"os"
    	"strconv"
    	"time"
    
    	"github.com/prometheus/client_golang/prometheus"
    	"github.com/prometheus/client_golang/prometheus/promhttp"
    	"github.com/streadway/amqp"
    )
    
    var (
    	jobsProcessed = prometheus.NewCounter(
    		prometheus.CounterOpts{
    			Name: "worker_jobs_processed_total",
    			Help: "Total number of jobs processed.",
    		},
    	)
    
    	jobDuration = prometheus.NewHistogram(
    		prometheus.HistogramOpts{
    			Name:    "worker_job_duration_seconds",
    			Help:    "Histogram of job processing duration in seconds.",
    			Buckets: []float64{1, 5, 10, 15, 20, 25, 30, 45, 60},
    		},
    	)
    )
    
    func failOnError(err error, msg string) {
    	if err != nil {
    		log.Fatalf("%s: %s", msg, err)
    	}
    }
    
    func main() {
    	rand.Seed(time.Now().UnixNano())
    
    	// Register Prometheus metrics
    	prometheus.MustRegister(jobsProcessed)
    	prometheus.MustRegister(jobDuration)
    
    	// Get RabbitMQ connection string from environment
    	amqpURL := os.Getenv("AMQP_URL")
    	if amqpURL == "" {
    		amqpURL = "amqp://guest:guest@localhost:5672/"
    	}
    
    	conn, err := amqp.Dial(amqpURL)
    	failOnError(err, "Failed to connect to RabbitMQ")
    	defer conn.Close()
    
    	ch, err := conn.Channel()
    	failOnError(err, "Failed to open a channel")
    	defer ch.Close()
    
    	q, err := ch.QueueDeclare(
    		"task_queue", // name
    		true,         // durable
    		false,        // delete when unused
    		false,        // exclusive
    		false,        // no-wait
    		nil,          // arguments
    	)
    	failOnError(err, "Failed to declare a queue")
    
    	err = ch.Qos(1, 0, false) // Ensure we only prefetch one message at a time
    	failOnError(err, "Failed to set QoS")
    
    	msgs, err := ch.Consume(
    		q.Name, // queue
    		"",     // consumer
    		false,  // auto-ack
    		false,  // exclusive
    		false,  // no-local
    		false,  // no-wait
    		nil,    // args
    	)
    	failOnError(err, "Failed to register a consumer")
    
    	// Expose the /metrics endpoint
    	go func() {
    		log.Println("Starting metrics server on :8080")
    		http.Handle("/metrics", promhttp.Handler())
    		log.Fatal(http.ListenAndServe(":8080", nil))
    	}()
    
    	forever := make(chan bool)
    
    	go func() {
    		for d := range msgs {
    			log.Printf("Received a message: %s", d.Body)
    			timer := prometheus.NewTimer(jobDuration)
    
    			// Simulate variable work
    			workTime, err := strconv.Atoi(string(d.Body))
    			if err != nil {
    				workTime = rand.Intn(15) + 1 // Default work if body is not an int
    			}
    			log.Printf("Simulating work for %d seconds...", workTime)
    			time.Sleep(time.Duration(workTime) * time.Second)
    
    			timer.ObserveDuration() // Record the duration in the histogram
    			jobsProcessed.Inc()
    			log.Printf("Done")
    			d.Ack(false) // Acknowledge the message
    		}
    	}()
    
    	log.Printf(" [*] Waiting for messages. To exit press CTRL+C")
    	<-forever
    }

    Key aspects of this application:

    * jobDuration Histogram: This is the critical metric. A histogram allows us to calculate quantiles (like p95) on the Prometheus server.

    * /metrics Endpoint: The application exposes a standard Prometheus metrics endpoint on port 8080.

    * QoS Prefetch: ch.Qos(1, ...) is a vital production practice. It prevents a single greedy consumer pod from pulling hundreds of messages from the queue at once, starving other workers and making scaling ineffective.

    Kubernetes Deployment

    Here is the Deployment for our consumer. Note the annotations required for Prometheus to discover and scrape the /metrics endpoint.

    consumer-deployment.yaml

    yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: video-consumer
      labels:
        app: video-consumer
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: video-consumer
      template:
        metadata:
          labels:
            app: video-consumer
          annotations:
            prometheus.io/scrape: 'true'
            prometheus.io/path: '/metrics'
            prometheus.io/port: '8080'
        spec:
          containers:
          - name: consumer
            image: your-repo/keda-latency-consumer:latest # Replace with your image
            env:
            - name: AMQP_URL
              valueFrom:
                secretKeyRef:
                  name: rabbitmq-conn
                  key: url

    At this stage, you would have a running consumer, a RabbitMQ instance, and a Prometheus server (e.g., via the Prometheus Operator) scraping its metrics. You can verify this by port-forwarding to a consumer pod and checking localhost:8080/metrics or querying Prometheus directly.


    Step 1: Implementing the Latency-Based Scaler

    Our goal is to scale up the video-consumer deployment whenever the p95 processing latency exceeds our 30-second SLO. This requires a KEDA ScaledObject configured with a prometheus trigger.

    Crafting the PromQL Query

    The heart of this pattern is the PromQL query. We need to calculate the 95th percentile of the worker_job_duration_seconds histogram. The query for this is:

    promql
    histogram_quantile(0.95, sum(rate(worker_job_duration_seconds_bucket[2m])) by (le))

    Let's break this down:

    * worker_job_duration_seconds_bucket: This selects the raw time series for our histogram's buckets.

    * rate(...[2m]): This calculates the per-second increase of the values in the buckets over a 2-minute window. Using rate is crucial for histograms as their bucket counters are cumulative. It normalizes the data and makes it resilient to pod restarts.

    * sum(...) by (le): This aggregates the rates across all consumer pods, preserving the bucket dimension (le - less than or equal to).

    * histogram_quantile(0.95, ...): This is the final function that takes the aggregated, rated bucket data and calculates the 95th percentile value.

    The choice of the time window ([2m]) is critical. Too short, and the scaling will be overly twitchy and react to brief, insignificant spikes. Too long, and it will be too slow to respond to genuine performance degradation. A 2-5 minute window is often a good starting point.

    The KEDA `ScaledObject`

    Now we can define the ScaledObject that uses this query.

    keda-scaledobject.yaml

    yaml
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: video-consumer-scaler
      namespace: default
    spec:
      scaleTargetRef:
        name: video-consumer # Name of the deployment to scale
      minReplicaCount: 1
      maxReplicaCount: 20
      pollingInterval: 15 # How often KEDA queries Prometheus (in seconds)
      cooldownPeriod:  90 # How long to wait after the last trigger activation before scaling down
    
      triggers:
      - type: prometheus
        metadata:
          serverAddress: http://prometheus-k8s.monitoring.svc.cluster.local:9090 # Address of your Prometheus server
          metricName: p95_latency_seconds # A descriptive name for KEDA's internal metrics
          query: |
            histogram_quantile(0.95, sum(rate(worker_job_duration_seconds_bucket{app="video-consumer"}[2m])) by (le))
          threshold: '30' # Our SLO: 30 seconds

    Analysis of Configuration:

    * serverAddress: This must point to your Prometheus service inside the cluster. The example uses the standard address for a Prometheus Operator installation in the monitoring namespace.

    * query: We embed our carefully crafted PromQL query. Notice the addition of {app="video-consumer"}. This is a best practice to ensure the query is scoped only to the metrics from our target deployment, preventing interference from other applications.

    * threshold: This is the target value. KEDA's Prometheus scaler works on a simple principle: (current metric value / threshold) = desired replicas. If our p95 latency hits 60 seconds, KEDA will calculate 60 / 30 = 2 desired replicas. If it hits 90s, it will want 3 replicas, and so on. It's a direct, proportional scaling mechanism.

    * pollingInterval: A 15-second polling interval provides good responsiveness without overwhelming Prometheus. For less critical workloads, 30-60 seconds might be more appropriate.

    * cooldownPeriod: This prevents flapping. After a scale-up event, KEDA waits for this period before considering a scale-down, giving the new pods time to start working and affect the metric.

    Once you apply this ScaledObject, KEDA will create an HPA that it manages. You can inspect it with kubectl get hpa keda-hpa-video-consumer-scaler. You'll see it's configured to use an external metric provided by KEDA's metrics adapter.


    Step 2: Advanced Pattern - Multi-Trigger Scaling

    The latency-based scaler is powerful, but it has a blind spot. What if no messages are being processed? In that case, the worker_job_duration_seconds metric won't be updated, the PromQL query will return no data, and KEDA will not scale up. A sudden burst of 10,000 messages will sit in the queue until the single active pod processes one, its latency spikes, and then the scaling process begins. This is too slow.

    We need to solve for two conditions:

  • A large backlog exists: We need to scale up proactively to handle the volume, even if current latency is low.
  • Processing is slow: We need to scale up reactively if the current pods are struggling to meet the SLO, even with a small queue.
  • This is a perfect use case for KEDA's multi-trigger capability. We'll combine our Prometheus scaler with a standard RabbitMQ queue length scaler.

    keda-scaledobject-multi-trigger.yaml

    yaml
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: video-consumer-scaler
      namespace: default
    spec:
      scaleTargetRef:
        name: video-consumer
      minReplicaCount: 1
      maxReplicaCount: 20
      pollingInterval: 15
      cooldownPeriod: 90
      advanced:
        horizontalPodAutoscalerConfig:
          behavior:
            scaleDown:
              stabilizationWindowSeconds: 300
              policies:
              - type: Percent
                value: 100
                periodSeconds: 15
    
      triggers:
      # Trigger 1: Scale on p95 processing latency
      - type: prometheus
        metadata:
          serverAddress: http://prometheus-k8s.monitoring.svc.cluster.local:9090
          metricName: p95_latency_seconds
          query: |
            histogram_quantile(0.95, sum(rate(worker_job_duration_seconds_bucket{app="video-consumer"}[2m])) by (le))
          threshold: '30'
    
      # Trigger 2: Scale on RabbitMQ queue length
      - type: rabbitmq
        metadata:
          hostFromEnv: AMQP_HOST # Assumes env var set from rabbitmq secret
          queueName: task_queue
          mode: QueueLength
          value: '50' # Target 50 messages per pod
        authenticationRef:
          name: keda-rabbitmq-auth

    keda-trigger-auth.yaml

    yaml
    apiVersion: keda.sh/v1alpha1
    kind: TriggerAuthentication
    metadata:
      name: keda-rabbitmq-auth
      namespace: default
    spec:
      secretTargetRef:
        - parameter: host
          name: rabbitmq-conn # The secret holding the connection string
          key: url

    The Scaling Logic

    When multiple triggers are defined, KEDA evaluates each one independently during every polling interval. It calculates the desired number of replicas for each trigger and then uses the maximum value.

    Let's trace the behavior:

    * Scenario A: High Volume, Fast Jobs. 1000 messages arrive, each taking 5 seconds.

    * Prometheus trigger: Latency is 5s, well below the 30s threshold. It requests 1 replica (5/30, rounded up).

    * RabbitMQ trigger: Queue length is 1000. It requests 20 replicas (1000/50).

    * KEDA's decision: The max is 20, so it scales the deployment to 20 pods to clear the backlog quickly.

    * Scenario B: Low Volume, Slow Jobs. 10 messages arrive, but each is a complex job taking 90 seconds.

    * Prometheus trigger: p95 latency is now 90s. It requests 3 replicas (90/30).

    * RabbitMQ trigger: Queue length is 10. It requests 1 replica (10/50, rounded up to minReplicaCount).

    * KEDA's decision: The max is 3, so it scales up to 3 pods to bring the processing latency back down towards the SLO.

    This composite approach provides a robust, two-tiered scaling strategy that handles both volume and performance degradation gracefully.

    Advanced HPA Behavior

    Note the advanced.horizontalPodAutoscalerConfig section. This is a powerful feature that allows you to directly configure the behavior of the underlying HPA that KEDA creates. Here, we've implemented a more conservative scale-down policy. The stabilizationWindowSeconds: 300 tells the HPA to look at the desired state over the last 5 minutes and use the highest recommendation from that period. This prevents it from aggressively scaling down the moment a performance metric or queue length briefly dips, making the system more stable.


    Production Considerations and Edge Cases

    Handling Scale-to-Zero with Custom Metrics

    KEDA's flagship feature is scaling deployments down to zero replicas. With a queue length scaler, this is straightforward: if the queue is empty, scale to zero. With a custom Prometheus metric, it's more complex.

    If we set minReplicaCount: 0 in our ScaledObject, what happens when the last pod scales down?

  • The video-consumer deployment has 0 replicas.
  • No metrics are being emitted for worker_job_duration_seconds.
  • The PromQL query histogram_quantile(...) will return an empty result.
  • By default, KEDA interprets an empty result from a Prometheus query as a metric value of 0. This is the desired behavior. When new messages arrive, the RabbitMQ trigger (which is still being evaluated by the KEDA operator) will detect a non-zero queue length and scale the deployment up to 1. The first pod will start processing, emit latency metrics, and then the Prometheus scaler will take over if needed.

    This highlights why the multi-trigger approach is so critical for a scale-to-zero architecture. The queue length scaler acts as the "igniter" to wake up the system, while the performance scaler acts as the "throttle" to ensure SLOs are met once it's running.

    The `activationThreshold` Edge Case

    For some scalers, you might want to scale to zero but only wake up if a significant backlog appears. KEDA has an activationThreshold for this. For the RabbitMQ scaler, this would make KEDA scale from 0 to 1 only when, say, 10 messages are in the queue. This can prevent scaling up on trivial or stray messages.

    For a Prometheus scaler, this is less common but could be used to prevent waking the system for a fleeting metric blip. However, it's generally more reliable to rely on a queue-based trigger for the initial activation from zero.

    Metric Feedback Loops

    Be mindful of metrics that can create negative feedback loops. For example, if you scaled based on CPU utilization, adding more pods would reduce the average CPU, which would then cause a scale-down, which would increase CPU, causing a scale-up. This leads to flapping.

    Our latency metric is generally safe from this. Adding more pods directly reduces the pressure on existing pods, lowering overall latency and creating a stable control loop. However, always analyze your chosen metric: does adding capacity reliably and predictably improve this metric? If not, it may be a poor candidate for autoscaling.

    Performance Tuning the KEDA Operator

    In a large cluster with hundreds of ScaledObjects, the KEDA operator itself can become a bottleneck. The key tuning parameters are:

    * pollingInterval: As discussed, this is a trade-off between responsiveness and load on the external system (Prometheus, RabbitMQ). Don't set it unnecessarily low.

    * KEDA Operator Resources: Monitor the CPU and memory usage of the keda-operator and keda-metrics-apiserver deployments. Allocate resources appropriately based on the number of ScaledObjects in your cluster.

    * Prometheus Performance: Your scaling is only as reliable as your monitoring system. Ensure your Prometheus instance is properly provisioned and can handle the query load from KEDA and other sources without significant delay. A slow Prometheus query response will directly translate to slow scaling decisions.


    Conclusion: From Reactive to Proactive Scaling

    By graduating from basic CPU/memory scaling to event-driven triggers, we build more efficient systems. By evolving further from simple queue-length metrics to custom, application-aware performance indicators like p95 latency, we build truly intelligent, resilient, and SLO-driven platforms.

    The pattern detailed here—instrumenting an application with specific performance metrics and using a multi-trigger KEDA ScaledObject—represents a mature approach to autoscaling in Kubernetes. It directly ties resource allocation to business-level objectives (e.g., "process 95% of jobs in under 30 seconds"), not just technical indicators.

    This level of control, combining a backlog-aware scaler for capacity and a performance-aware scaler for speed, allows senior engineers to design systems that are both cost-effective during idle periods and powerfully responsive under load. It moves the conversation from "how many pods do we need?" to "what performance do we need to guarantee, and how do we automate the system to achieve it?"

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles