KEDA + Prometheus: Advanced Autoscaling for Event-Driven K8s Workloads

October 9, 2025

18 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inadequacy of Native HPA for Asynchronous Systems

As senior engineers, we've all encountered the fundamental limitations of Kubernetes' Horizontal Pod Autoscaler (HPA) when dealing with event-driven, asynchronous workloads. The standard HPA excels at scaling based on CPU and memory utilization, which are perfect proxies for synchronous, request-response services. When traffic increases, CPU load rises, and the HPA adds more pods. Simple, effective, and predictable.

However, this model collapses for workers processing jobs from a message queue (e.g., RabbitMQ, Kafka, SQS). Consider a video transcoding service. A worker pod might be consuming a single, computationally intensive job, running its CPU at 95% utilization. The HPA sees this high CPU and correctly maintains the current pod count or even scales up. But what if the RabbitMQ queue has a backlog of 10,000 pending videos? The HPA is completely blind to this business-critical metric. The system is falling behind, service level objectives (SLOs) are at risk, but from a pure resource-utilization perspective, everything looks fine.

Conversely, a fleet of 10 worker pods might be sitting idle with 1% CPU utilization because the queue is empty. The HPA will aggressively scale them down, which is desirable. But the moment 100 jobs are published to the queue, there's a significant lag while the HPA waits for the few remaining pods to pick up jobs, ramp up their CPU, cross the scaling threshold, and finally trigger a scale-up event. This latency is often unacceptable.

The core problem is a decoupling of workload pressure from resource metrics. For event-driven systems, the true measure of load is the backlog—the queue length, the consumer lag, or some other custom business metric. To build truly responsive and cost-efficient systems, we must scale based on this backlog. This is where KEDA (Kubernetes-based Event-Driven Autoscaling) becomes an indispensable tool in our arsenal.

This article is not an introduction to KEDA. We assume you understand its basic purpose. Instead, we will construct a production-ready, end-to-end solution for scaling a worker deployment based on a RabbitMQ queue length, exposed as a custom metric via Prometheus. We will dive deep into the implementation of the metric exporter, the nuances of the ScaledObject configuration, advanced stabilization patterns, and critical edge case handling.

Architectural Blueprint: The KEDA-Prometheus Trifecta

Our architecture consists of several key components working in concert to achieve backlog-aware scaling:

The Workload (video-processor): A Kubernetes Deployment of pods that consume jobs from a RabbitMQ queue.

The Message Queue (RabbitMQ): The source of truth for our work backlog.

The Custom Exporter (rabbitmq-exporter): A small, dedicated service we will build. Its sole responsibility is to query the RabbitMQ Management API, retrieve the number of ready messages in a specific queue, and expose this value as a Prometheus metric.

Prometheus: Our time-series database and monitoring system. It will be configured to scrape the custom exporter, collecting the queue length metric over time.

KEDA Operator: The control plane component that watches for ScaledObject custom resources. It acts as a Kubernetes metrics server, exposing the custom metric from Prometheus to the HPA.

The ScaledObject: A KEDA custom resource that defines the scaling rules. It links the target deployment (video-processor), the trigger (our Prometheus query), and the scaling parameters (min/max replicas, polling interval, etc.).

Here is a visual representation of the data flow for a scaling decision:

mermaid

graph TD
    subgraph Kubernetes Cluster
        A[RabbitMQ Queue] -->|HTTP API Query| B(Custom rabbitmq-exporter)
        B -->|Exposes /metrics endpoint| C{Prometheus}
        C -->|Serves PromQL Query| D[KEDA Metrics Adapter]
        D -->|Provides custom metric| E[Horizontal Pod Autoscaler]
        E -->|Scales Deployment| F(video-processor Deployment)
        F -->|Consumes jobs| A
    end

    G[KEDA Operator] -- Manages --> E
    G -- Watches --> H(ScaledObject CRD)

This architecture effectively translates an external business metric (queue length) into a native Kubernetes scaling action, giving us the control we need.

Step 1: Implementing a Production-Grade Custom Metric Exporter

A robust exporter is the foundation of this pattern. A poorly implemented exporter can become a single point of failure, causing scaling to halt or behave erratically. We'll write a lightweight exporter in Go for its performance and small footprint.

Key requirements for our exporter:

* Connect to the RabbitMQ Management API.

* Handle connection errors and timeouts gracefully.

* Expose a /metrics endpoint in the Prometheus text format.

* Be configurable via environment variables for production flexibility.

* Be containerized for easy deployment to Kubernetes.

Here is the complete, runnable Go code for main.go:

// main.go
package main

import (
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"os"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

// Config holds configuration values for the exporter
type Config struct {
	RabbitMQAPIURL string
	RabbitMQUser     string
	RabbitMQPass     string
	QueueName        string
	VHost            string
	ListenAddr       string
}

// loadConfig loads configuration from environment variables
func loadConfig() Config {
	// Use os.Getenv for real implementation, hardcoding for example clarity
	return Config{
		RabbitMQAPIURL: getEnv("RABBITMQ_API_URL", "http://rabbitmq.default.svc.cluster.local:15672"),
		RabbitMQUser:   getEnv("RABBITMQ_USER", "guest"),
		RabbitMQPass:   getEnv("RABBITMQ_PASS", "guest"),
		QueueName:      getEnv("QUEUE_NAME", "video_processing"),
		VHost:          getEnv("VHOST", "/"),
		ListenAddr:     getEnv("LISTEN_ADDR", ":9091"),
	}
}

func getEnv(key, fallback string) string {
	if value, exists := os.LookupEnv(key); exists {
		return value
	}
	return fallback
}

// RabbitMQCollector is our custom Prometheus collector
type RabbitMQCollector struct {
	config            Config
	client            *http.Client
	messagesReadyDesc *prometheus.Desc
}

// NewRabbitMQCollector creates a new collector
func NewRabbitMQCollector(config Config) *RabbitMQCollector {
	return &RabbitMQCollector{
		config: config,
		client: &http.Client{Timeout: 5 * time.Second},
		messagesReadyDesc: prometheus.NewDesc(
			"rabbitmq_queue_messages_ready",
			"Number of messages ready to be delivered in the queue.",
			[]string{"queue_name", "vhost"}, nil,
		),
	}
}

// Describe implements the prometheus.Collector interface
func (c *RabbitMQCollector) Describe(ch chan<- *prometheus.Desc) {
	ch <- c.messagesReadyDesc
}

// Collect implements the prometheus.Collector interface
func (c *RabbitMQCollector) Collect(ch chan<- prometheus.Metric) {
	// The VHost '/' needs to be encoded as '%2f'
	vhost := c.config.VHost
	if vhost == "/" {
		vhost = "%2f"
	}
	url := fmt.Sprintf("%s/api/queues/%s/%s", c.config.RabbitMQAPIURL, vhost, c.config.QueueName)

	req, err := http.NewRequest("GET", url, nil)
	if err != nil {
		log.Printf("Error creating request: %v", err)
		return
	}
	req.SetBasicAuth(c.config.RabbitMQUser, c.config.RabbitMQPass)

	resp, err := c.client.Do(req)
	if err != nil {
		log.Printf("Error querying RabbitMQ API: %v", err)
		// Important: Do not send a metric if the scrape fails. A zero value is misleading.
		// Prometheus will treat the absence of a metric as a stale series.
		return
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		log.Printf("RabbitMQ API returned non-200 status: %d", resp.StatusCode)
		return
	}

	var queueInfo struct {
		MessagesReady float64 `json:"messages_ready"`
	}

	if err := json.NewDecoder(resp.Body).Decode(&queueInfo); err != nil {
		log.Printf("Error decoding RabbitMQ API response: %v", err)
		return
	}

	ch <- prometheus.MustNewConstMetric(
		c.messagesReadyDesc,
		prometheus.GaugeValue,
		queueInfo.MessagesReady,
		c.config.QueueName,
		c.config.VHost,
	)
}

func main() {
	config := loadConfig()

	collector := NewRabbitMQCollector(config)
	prometheus.MustRegister(collector)

	http.Handle("/metrics", promhttp.Handler())
	http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
		fmt.Fprintln(w, "OK")
	})

	log.Printf("Starting RabbitMQ metrics exporter on %s", config.ListenAddr)
	if err := http.ListenAndServe(config.ListenAddr, nil); err != nil {
		log.Fatalf("Failed to start server: %v", err)
	}
}

Critical Implementation Detail: In the Collect function's error handling, we simply log the error and return. We do not push a metric with a value of 0. This is crucial. A value of 0 means the queue is empty, which would incorrectly trigger a scale-down. By not reporting the metric, Prometheus marks the time series as stale. We can configure our KEDA trigger and alerting to handle this stale data appropriately, which is a much safer failure mode.

Now, let's containerize it with a multi-stage Dockerfile for a lean production image:

dockerfile

# Dockerfile
# Stage 1: Build the binary
FROM golang:1.21-alpine AS builder

WORKDIR /app

COPY go.mod go.sum ./
RUN go mod download

COPY . .

RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o rabbitmq-exporter .

# Stage 2: Create the final, minimal image
FROM alpine:latest

RUN apk --no-cache add ca-certificates

WORKDIR /root/

COPY --from=builder /app/rabbitmq-exporter .

EXPOSE 9091

CMD ["./rabbitmq-exporter"]

Finally, the Kubernetes manifests to deploy our exporter:

yaml

# rabbitmq-exporter-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rabbitmq-exporter
  labels:
    app: rabbitmq-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: rabbitmq-exporter
  template:
    metadata:
      labels:
        app: rabbitmq-exporter
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9091'
    spec:
      containers:
      - name: exporter
        image: your-repo/rabbitmq-exporter:latest # Replace with your image
        ports:
        - containerPort: 9091
          name: metrics
        env:
        - name: RABBITMQ_API_URL
          value: "http://your-rabbitmq-service:15672"
        - name: QUEUE_NAME
          value: "video_processing"
        # It's better to use secrets for credentials
        - name: RABBITMQ_USER
          valueFrom:
            secretKeyRef:
              name: rabbitmq-secret
              key: username
        - name: RABBITMQ_PASS
          valueFrom:
            secretKeyRef:
              name: rabbitmq-secret
              key: password
        readinessProbe:
          httpGet:
            path: /healthz
            port: 9091
          initialDelaySeconds: 5
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: rabbitmq-exporter
  labels:
    app: rabbitmq-exporter
spec:
  selector:
    app: rabbitmq-exporter
  ports:
  - name: metrics
    port: 9091
    targetPort: 9091

With the exporter deployed and the prometheus.io/scrape: 'true' annotation set, a standard Prometheus installation will automatically discover and begin scraping our rabbitmq_queue_messages_ready metric.

Step 2: The KEDA `ScaledObject` - A Deep Dive

The ScaledObject is where the magic happens. It's the declarative configuration that tells KEDA how to scale our video-processor deployment.

Let's assume we have a video-processor deployment already running. Here is the ScaledObject manifest with detailed explanations.

yaml

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: video-processor-scaler
  namespace: default
spec:
  scaleTargetRef:
    name: video-processor # The name of the deployment to scale
  
  # Polling interval for checking the metric. Trade-off between responsiveness and load.
  pollingInterval: 15 # In seconds

  # Cooldown period after the last trigger event before scaling down to minReplicas
  cooldownPeriod: 120 # In seconds. Prevents flapping.

  # The minimum and maximum number of replicas
  minReplicaCount: 0 # Enable scale-to-zero for cost savings
  maxReplicaCount: 50

  # Advanced scaling behavior configuration
  advanced:
    restoreToOriginalReplicaCount: false # Don't scale up to original count on KEDA activation
    horizontalPodAutoscalerConfig:
      # Control the HPA behavior directly for stabilization
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 100
            periodSeconds: 60
        scaleUp:
          stabilizationWindowSeconds: 0 # Scale up immediately

  # The core trigger configuration
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-k8s.monitoring.svc.cluster.local:9090
      # The PromQL query to get our metric value
      query: |-
        sum(rabbitmq_queue_messages_ready{queue_name="video_processing"})
      # The target value for the metric. KEDA calculates desiredReplicas = ceil(metricValue / threshold)
      threshold: '10'
      # What to do if Prometheus returns no data for the query
      # This is critical for handling exporter/Prometheus downtime
      queryStep: '1' # Optional, defaults to '1'

Let's dissect the most critical sections:

* minReplicaCount: 0: This is a killer feature of KEDA. When the queue length is zero, KEDA will remove the HPA object entirely, allowing the video-processor deployment to scale down to zero pods. This provides immense cost savings during idle periods. When the first message arrives and the metric becomes > 0, KEDA recreates the HPA and scales the deployment to 1 pod.

* threshold: '10': This is the heart of the scaling logic. The formula KEDA uses is desiredReplicas = ceil(metricValue / threshold). This means each pod is expected to handle a backlog of up to 10 messages.

* If queue length is 0, desiredReplicas is 0 (if minReplicaCount is 0).

* If queue length is 1-10, desiredReplicas is 1.

* If queue length is 99, desiredReplicas is ceil(99/10) = 10.

* If queue length is 100, desiredReplicas is ceil(100/10) = 10.

* If queue length is 101, desiredReplicas is ceil(101/10) = 11.

This provides a linear, predictable scaling response directly tied to the work backlog.

* advanced.horizontalPodAutoscalerConfig.behavior: This is an advanced feature that gives us fine-grained control over the underlying HPA's scaling velocity, preventing flapping.

* scaleDown.stabilizationWindowSeconds: 300: The HPA will look at the desired replica count over the last 5 minutes and use the highest recommendation from that window. This prevents scaling down immediately if the queue length briefly dips, only to spike again. It smooths out the scale-down behavior.

* scaleUp.stabilizationWindowSeconds: 0: We want to react to a growing backlog as quickly as possible, so we set the scale-up stabilization window to zero.

Production Patterns and Advanced Configurations

Pattern 1: Combining Backlog and Resource Metrics

What if a single video job is extremely CPU-intensive? A queue length of 5 might only scale us to 1 pod based on our threshold, but that one pod could be completely saturated at 100% CPU, unable to pick up more work. In this scenario, we need to scale on both queue length and CPU utilization.

KEDA supports multiple triggers. It will evaluate each trigger independently and choose the one that recommends the highest number of replicas.

yaml

# ... (rest of ScaledObject)
spec:
  # ...
  triggers:
  - type: prometheus # Our queue length trigger
    metadata:
      serverAddress: http://prometheus-k8s.monitoring.svc.cluster.local:9090
      query: sum(rabbitmq_queue_messages_ready{queue_name="video_processing"})
      threshold: '10'
  
  - type: cpu # A standard CPU trigger
    metadata:
      value: '80' # Target 80% CPU utilization

With this configuration:

* If the queue length is 50 (recommends 5 pods) and CPU is at 30%, KEDA will scale to 5 pods.

* If the queue length is 5 (recommends 1 pod) but that one pod's CPU is at 95%, the CPU trigger will recommend scaling up. KEDA will honor the higher recommendation and add more pods.

This hybrid approach provides the best of both worlds: proactive scaling based on backlog and reactive scaling based on immediate resource pressure.

Pattern 2: Authenticating with Prometheus

In a production environment, Prometheus is rarely exposed without authentication. KEDA handles this using a TriggerAuthentication resource.

First, create a secret with your credentials:

yaml

apiVersion: v1
kind: Secret
metadata:
  name: prometheus-auth
type: Opaque
data:
  # base64 encoded values
  username: dXNlcg==
  password: cGFzc3dvcmQ=

Next, create the TriggerAuthentication CRD to reference this secret:

yaml

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-prometheus-auth
spec:
  secretTargetRef:
    - parameter: username
      name: prometheus-auth
      key: username
    - parameter: password
      name: prometheus-auth
      key: password

Finally, reference this TriggerAuthentication in your ScaledObject:

yaml

# ... (in ScaledObject triggers section)
  triggers:
  - type: prometheus
    metadata:
      # ...
    authenticationRef:
      name: keda-prometheus-auth

Edge Case Analysis: Preparing for Failure

A robust system is defined by how it handles failure. Let's analyze the critical failure modes.

Edge Case 1: Prometheus is Down or Unreachable

What happens when KEDA's metrics server cannot reach the Prometheus serverAddress?

The HPA controller will query KEDA for the metric. KEDA will attempt the PromQL query, which will time out or fail. KEDA's metrics server will return an error to the HPA. The HPA's behavior in this situation is to do nothing. It will not scale up or down; it will maintain the current number of replicas and log an error event.

Mitigation Strategy:

Alerting: This is paramount. You must have a separate monitoring system (e.g., Alertmanager) fire a critical alert if KEDA cannot scrape its metric source. An alert on the KEDA operator logs or on HPA error events is essential.

Fallback Replicas (KEDA v2.10+): Newer versions of KEDA are introducing the concept of fallback replicas, which allows you to specify a number of replicas to scale to if the scaler is in an error state. This is a powerful feature for maintaining a baseline level of service during a metrics pipeline outage.

Edge Case 2: The Custom Exporter Fails

If our rabbitmq-exporter pod crashes or becomes unresponsive, Prometheus will fail to scrape it. The rabbitmq_queue_messages_ready time series will become stale. PromQL queries for this metric will return an empty result set (no data).

By default, sum() on an empty set returns a value of 0. This is dangerous, as it would cause KEDA to scale the deployment down to minReplicaCount, even if the queue is full.

Mitigation Strategy:

We can make our PromQL query more resilient using the absent() function or by checking the age of the metric.

A more robust query could be:

promql

(sum(rabbitmq_queue_messages_ready{queue_name="video_processing"}) or on() vector(0)) unless (absent(rabbitmq_queue_messages_ready{queue_name="video_processing"}))

This is complex. A simpler and often better approach is to rely on alerting. Create an alert in Prometheus that fires if the exporter target is down (up{job="rabbitmq-exporter"} == 0) or if the metric has been absent for more than a few minutes (absent_over_time(rabbitmq_queue_messages_ready[5m])). This operationalizes the failure detection rather than over-complicating the scaling logic.

Edge Case 3: The "Poison Pill" Message

A message enters the queue that causes any worker consuming it to panic and crash. The pod enters a CrashLoopBackOff state. The message is re-queued by RabbitMQ and immediately picked up by another worker, which also crashes.

In this scenario:

The queue length remains high or grows.

KEDA sees the high backlog and continues to scale up the deployment.

The new pods also pick up the poison pill message and crash.

You are now paying for a large number of pods that are doing no useful work. This is an application-level problem, but our scaling mechanism is exacerbating it.

Mitigation Strategy:

This requires a multi-faceted defense:

Dead-Letter Queues (DLQ): Configure your RabbitMQ queues with a DLQ. After a message is re-queued a certain number of times, it should be moved to a separate queue for manual inspection. This isolates the poison pill.

Application Metrics: Your workers should expose metrics like jobs_processed_total and jobs_failed_total. A scaling decision should ideally consider the processing rate, not just the queue length. A more advanced custom KEDA scaler could be written to incorporate this logic: scale_if(queue_length > X AND processing_rate > Y).

Alerting: An alert on a high rate of CrashLoopBackOff for the worker deployment is a critical backstop to catch this scenario quickly.

Conclusion: Scaling on Intent, Not Consumption

By integrating KEDA with custom Prometheus metrics, we fundamentally shift our scaling paradigm from being reactive to resource consumption to being proactive based on work-to-be-done. This pattern allows us to build systems that are simultaneously more responsive to load, more resilient to transient spikes, and significantly more cost-effective by enabling true scale-to-zero during idle periods.

While the setup is more involved than a standard HPA, the operational benefits for any non-trivial, event-driven architecture are immense. It provides a level of control and observability that aligns infrastructure behavior directly with business logic, which is the hallmark of a mature, production-grade cloud-native system.

The Inadequacy of Native HPA for Asynchronous Systems

Architectural Blueprint: The KEDA-Prometheus Trifecta

Step 1: Implementing a Production-Grade Custom Metric Exporter

Step 2: The KEDA `ScaledObject` - A Deep Dive

Production Patterns and Advanced Configurations

Pattern 1: Combining Backlog and Resource Metrics

Pattern 2: Authenticating with Prometheus

Edge Case Analysis: Preparing for Failure

Edge Case 1: Prometheus is Down or Unreachable

Edge Case 2: The Custom Exporter Fails

Edge Case 3: The "Poison Pill" Message

Conclusion: Scaling on Intent, Not Consumption

Found this article helpful?