KEDA + Prometheus: Advanced Autoscaling for Event-Driven K8s Workloads
The Inadequacy of Native HPA for Asynchronous Systems
As senior engineers, we've all encountered the fundamental limitations of Kubernetes' Horizontal Pod Autoscaler (HPA) when dealing with event-driven, asynchronous workloads. The standard HPA excels at scaling based on CPU and memory utilization, which are perfect proxies for synchronous, request-response services. When traffic increases, CPU load rises, and the HPA adds more pods. Simple, effective, and predictable.
However, this model collapses for workers processing jobs from a message queue (e.g., RabbitMQ, Kafka, SQS). Consider a video transcoding service. A worker pod might be consuming a single, computationally intensive job, running its CPU at 95% utilization. The HPA sees this high CPU and correctly maintains the current pod count or even scales up. But what if the RabbitMQ queue has a backlog of 10,000 pending videos? The HPA is completely blind to this business-critical metric. The system is falling behind, service level objectives (SLOs) are at risk, but from a pure resource-utilization perspective, everything looks fine.
Conversely, a fleet of 10 worker pods might be sitting idle with 1% CPU utilization because the queue is empty. The HPA will aggressively scale them down, which is desirable. But the moment 100 jobs are published to the queue, there's a significant lag while the HPA waits for the few remaining pods to pick up jobs, ramp up their CPU, cross the scaling threshold, and finally trigger a scale-up event. This latency is often unacceptable.
The core problem is a decoupling of workload pressure from resource metrics. For event-driven systems, the true measure of load is the backlog—the queue length, the consumer lag, or some other custom business metric. To build truly responsive and cost-efficient systems, we must scale based on this backlog. This is where KEDA (Kubernetes-based Event-Driven Autoscaling) becomes an indispensable tool in our arsenal.
This article is not an introduction to KEDA. We assume you understand its basic purpose. Instead, we will construct a production-ready, end-to-end solution for scaling a worker deployment based on a RabbitMQ queue length, exposed as a custom metric via Prometheus. We will dive deep into the implementation of the metric exporter, the nuances of the ScaledObject configuration, advanced stabilization patterns, and critical edge case handling.
Architectural Blueprint: The KEDA-Prometheus Trifecta
Our architecture consists of several key components working in concert to achieve backlog-aware scaling:
video-processor): A Kubernetes Deployment of pods that consume jobs from a RabbitMQ queue.rabbitmq-exporter): A small, dedicated service we will build. Its sole responsibility is to query the RabbitMQ Management API, retrieve the number of ready messages in a specific queue, and expose this value as a Prometheus metric.ScaledObject custom resources. It acts as a Kubernetes metrics server, exposing the custom metric from Prometheus to the HPA.ScaledObject: A KEDA custom resource that defines the scaling rules. It links the target deployment (video-processor), the trigger (our Prometheus query), and the scaling parameters (min/max replicas, polling interval, etc.).Here is a visual representation of the data flow for a scaling decision:
graph TD
subgraph Kubernetes Cluster
A[RabbitMQ Queue] -->|HTTP API Query| B(Custom rabbitmq-exporter)
B -->|Exposes /metrics endpoint| C{Prometheus}
C -->|Serves PromQL Query| D[KEDA Metrics Adapter]
D -->|Provides custom metric| E[Horizontal Pod Autoscaler]
E -->|Scales Deployment| F(video-processor Deployment)
F -->|Consumes jobs| A
end
G[KEDA Operator] -- Manages --> E
G -- Watches --> H(ScaledObject CRD)
This architecture effectively translates an external business metric (queue length) into a native Kubernetes scaling action, giving us the control we need.
Step 1: Implementing a Production-Grade Custom Metric Exporter
A robust exporter is the foundation of this pattern. A poorly implemented exporter can become a single point of failure, causing scaling to halt or behave erratically. We'll write a lightweight exporter in Go for its performance and small footprint.
Key requirements for our exporter:
* Connect to the RabbitMQ Management API.
* Handle connection errors and timeouts gracefully.
* Expose a /metrics endpoint in the Prometheus text format.
* Be configurable via environment variables for production flexibility.
* Be containerized for easy deployment to Kubernetes.
Here is the complete, runnable Go code for main.go:
// main.go
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"strconv"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Config holds configuration values for the exporter
type Config struct {
RabbitMQAPIURL string
RabbitMQUser string
RabbitMQPass string
QueueName string
VHost string
ListenAddr string
}
// loadConfig loads configuration from environment variables
func loadConfig() Config {
// Use os.Getenv for real implementation, hardcoding for example clarity
return Config{
RabbitMQAPIURL: getEnv("RABBITMQ_API_URL", "http://rabbitmq.default.svc.cluster.local:15672"),
RabbitMQUser: getEnv("RABBITMQ_USER", "guest"),
RabbitMQPass: getEnv("RABBITMQ_PASS", "guest"),
QueueName: getEnv("QUEUE_NAME", "video_processing"),
VHost: getEnv("VHOST", "/"),
ListenAddr: getEnv("LISTEN_ADDR", ":9091"),
}
}
func getEnv(key, fallback string) string {
if value, exists := os.LookupEnv(key); exists {
return value
}
return fallback
}
// RabbitMQCollector is our custom Prometheus collector
type RabbitMQCollector struct {
config Config
client *http.Client
messagesReadyDesc *prometheus.Desc
}
// NewRabbitMQCollector creates a new collector
func NewRabbitMQCollector(config Config) *RabbitMQCollector {
return &RabbitMQCollector{
config: config,
client: &http.Client{Timeout: 5 * time.Second},
messagesReadyDesc: prometheus.NewDesc(
"rabbitmq_queue_messages_ready",
"Number of messages ready to be delivered in the queue.",
[]string{"queue_name", "vhost"}, nil,
),
}
}
// Describe implements the prometheus.Collector interface
func (c *RabbitMQCollector) Describe(ch chan<- *prometheus.Desc) {
ch <- c.messagesReadyDesc
}
// Collect implements the prometheus.Collector interface
func (c *RabbitMQCollector) Collect(ch chan<- prometheus.Metric) {
// The VHost '/' needs to be encoded as '%2f'
vhost := c.config.VHost
if vhost == "/" {
vhost = "%2f"
}
url := fmt.Sprintf("%s/api/queues/%s/%s", c.config.RabbitMQAPIURL, vhost, c.config.QueueName)
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Printf("Error creating request: %v", err)
return
}
req.SetBasicAuth(c.config.RabbitMQUser, c.config.RabbitMQPass)
resp, err := c.client.Do(req)
if err != nil {
log.Printf("Error querying RabbitMQ API: %v", err)
// Important: Do not send a metric if the scrape fails. A zero value is misleading.
// Prometheus will treat the absence of a metric as a stale series.
return
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
log.Printf("RabbitMQ API returned non-200 status: %d", resp.StatusCode)
return
}
var queueInfo struct {
MessagesReady float64 `json:"messages_ready"`
}
if err := json.NewDecoder(resp.Body).Decode(&queueInfo); err != nil {
log.Printf("Error decoding RabbitMQ API response: %v", err)
return
}
ch <- prometheus.MustNewConstMetric(
c.messagesReadyDesc,
prometheus.GaugeValue,
queueInfo.MessagesReady,
c.config.QueueName,
c.config.VHost,
)
}
func main() {
config := loadConfig()
collector := NewRabbitMQCollector(config)
prometheus.MustRegister(collector)
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
fmt.Fprintln(w, "OK")
})
log.Printf("Starting RabbitMQ metrics exporter on %s", config.ListenAddr)
if err := http.ListenAndServe(config.ListenAddr, nil); err != nil {
log.Fatalf("Failed to start server: %v", err)
}
}
Critical Implementation Detail: In the Collect function's error handling, we simply log the error and return. We do not push a metric with a value of 0. This is crucial. A value of 0 means the queue is empty, which would incorrectly trigger a scale-down. By not reporting the metric, Prometheus marks the time series as stale. We can configure our KEDA trigger and alerting to handle this stale data appropriately, which is a much safer failure mode.
Now, let's containerize it with a multi-stage Dockerfile for a lean production image:
# Dockerfile
# Stage 1: Build the binary
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o rabbitmq-exporter .
# Stage 2: Create the final, minimal image
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/rabbitmq-exporter .
EXPOSE 9091
CMD ["./rabbitmq-exporter"]
Finally, the Kubernetes manifests to deploy our exporter:
# rabbitmq-exporter-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rabbitmq-exporter
labels:
app: rabbitmq-exporter
spec:
replicas: 1
selector:
matchLabels:
app: rabbitmq-exporter
template:
metadata:
labels:
app: rabbitmq-exporter
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9091'
spec:
containers:
- name: exporter
image: your-repo/rabbitmq-exporter:latest # Replace with your image
ports:
- containerPort: 9091
name: metrics
env:
- name: RABBITMQ_API_URL
value: "http://your-rabbitmq-service:15672"
- name: QUEUE_NAME
value: "video_processing"
# It's better to use secrets for credentials
- name: RABBITMQ_USER
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: username
- name: RABBITMQ_PASS
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: password
readinessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: rabbitmq-exporter
labels:
app: rabbitmq-exporter
spec:
selector:
app: rabbitmq-exporter
ports:
- name: metrics
port: 9091
targetPort: 9091
With the exporter deployed and the prometheus.io/scrape: 'true' annotation set, a standard Prometheus installation will automatically discover and begin scraping our rabbitmq_queue_messages_ready metric.
Step 2: The KEDA `ScaledObject` - A Deep Dive
The ScaledObject is where the magic happens. It's the declarative configuration that tells KEDA how to scale our video-processor deployment.
Let's assume we have a video-processor deployment already running. Here is the ScaledObject manifest with detailed explanations.
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: video-processor-scaler
namespace: default
spec:
scaleTargetRef:
name: video-processor # The name of the deployment to scale
# Polling interval for checking the metric. Trade-off between responsiveness and load.
pollingInterval: 15 # In seconds
# Cooldown period after the last trigger event before scaling down to minReplicas
cooldownPeriod: 120 # In seconds. Prevents flapping.
# The minimum and maximum number of replicas
minReplicaCount: 0 # Enable scale-to-zero for cost savings
maxReplicaCount: 50
# Advanced scaling behavior configuration
advanced:
restoreToOriginalReplicaCount: false # Don't scale up to original count on KEDA activation
horizontalPodAutoscalerConfig:
# Control the HPA behavior directly for stabilization
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
# The core trigger configuration
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-k8s.monitoring.svc.cluster.local:9090
# The PromQL query to get our metric value
query: |-
sum(rabbitmq_queue_messages_ready{queue_name="video_processing"})
# The target value for the metric. KEDA calculates desiredReplicas = ceil(metricValue / threshold)
threshold: '10'
# What to do if Prometheus returns no data for the query
# This is critical for handling exporter/Prometheus downtime
queryStep: '1' # Optional, defaults to '1'
Let's dissect the most critical sections:
* minReplicaCount: 0: This is a killer feature of KEDA. When the queue length is zero, KEDA will remove the HPA object entirely, allowing the video-processor deployment to scale down to zero pods. This provides immense cost savings during idle periods. When the first message arrives and the metric becomes > 0, KEDA recreates the HPA and scales the deployment to 1 pod.
* threshold: '10': This is the heart of the scaling logic. The formula KEDA uses is desiredReplicas = ceil(metricValue / threshold). This means each pod is expected to handle a backlog of up to 10 messages.
* If queue length is 0, desiredReplicas is 0 (if minReplicaCount is 0).
* If queue length is 1-10, desiredReplicas is 1.
* If queue length is 99, desiredReplicas is ceil(99/10) = 10.
* If queue length is 100, desiredReplicas is ceil(100/10) = 10.
* If queue length is 101, desiredReplicas is ceil(101/10) = 11.
This provides a linear, predictable scaling response directly tied to the work backlog.
* advanced.horizontalPodAutoscalerConfig.behavior: This is an advanced feature that gives us fine-grained control over the underlying HPA's scaling velocity, preventing flapping.
* scaleDown.stabilizationWindowSeconds: 300: The HPA will look at the desired replica count over the last 5 minutes and use the highest recommendation from that window. This prevents scaling down immediately if the queue length briefly dips, only to spike again. It smooths out the scale-down behavior.
* scaleUp.stabilizationWindowSeconds: 0: We want to react to a growing backlog as quickly as possible, so we set the scale-up stabilization window to zero.
Production Patterns and Advanced Configurations
Pattern 1: Combining Backlog and Resource Metrics
What if a single video job is extremely CPU-intensive? A queue length of 5 might only scale us to 1 pod based on our threshold, but that one pod could be completely saturated at 100% CPU, unable to pick up more work. In this scenario, we need to scale on both queue length and CPU utilization.
KEDA supports multiple triggers. It will evaluate each trigger independently and choose the one that recommends the highest number of replicas.
# ... (rest of ScaledObject)
spec:
# ...
triggers:
- type: prometheus # Our queue length trigger
metadata:
serverAddress: http://prometheus-k8s.monitoring.svc.cluster.local:9090
query: sum(rabbitmq_queue_messages_ready{queue_name="video_processing"})
threshold: '10'
- type: cpu # A standard CPU trigger
metadata:
value: '80' # Target 80% CPU utilization
With this configuration:
* If the queue length is 50 (recommends 5 pods) and CPU is at 30%, KEDA will scale to 5 pods.
* If the queue length is 5 (recommends 1 pod) but that one pod's CPU is at 95%, the CPU trigger will recommend scaling up. KEDA will honor the higher recommendation and add more pods.
This hybrid approach provides the best of both worlds: proactive scaling based on backlog and reactive scaling based on immediate resource pressure.
Pattern 2: Authenticating with Prometheus
In a production environment, Prometheus is rarely exposed without authentication. KEDA handles this using a TriggerAuthentication resource.
First, create a secret with your credentials:
apiVersion: v1
kind: Secret
metadata:
name: prometheus-auth
type: Opaque
data:
# base64 encoded values
username: dXNlcg==
password: cGFzc3dvcmQ=
Next, create the TriggerAuthentication CRD to reference this secret:
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: keda-prometheus-auth
spec:
secretTargetRef:
- parameter: username
name: prometheus-auth
key: username
- parameter: password
name: prometheus-auth
key: password
Finally, reference this TriggerAuthentication in your ScaledObject:
# ... (in ScaledObject triggers section)
triggers:
- type: prometheus
metadata:
# ...
authenticationRef:
name: keda-prometheus-auth
Edge Case Analysis: Preparing for Failure
A robust system is defined by how it handles failure. Let's analyze the critical failure modes.
Edge Case 1: Prometheus is Down or Unreachable
What happens when KEDA's metrics server cannot reach the Prometheus serverAddress?
The HPA controller will query KEDA for the metric. KEDA will attempt the PromQL query, which will time out or fail. KEDA's metrics server will return an error to the HPA. The HPA's behavior in this situation is to do nothing. It will not scale up or down; it will maintain the current number of replicas and log an error event.
Mitigation Strategy:
fallback replicas, which allows you to specify a number of replicas to scale to if the scaler is in an error state. This is a powerful feature for maintaining a baseline level of service during a metrics pipeline outage.Edge Case 2: The Custom Exporter Fails
If our rabbitmq-exporter pod crashes or becomes unresponsive, Prometheus will fail to scrape it. The rabbitmq_queue_messages_ready time series will become stale. PromQL queries for this metric will return an empty result set (no data).
By default, sum() on an empty set returns a value of 0. This is dangerous, as it would cause KEDA to scale the deployment down to minReplicaCount, even if the queue is full.
Mitigation Strategy:
We can make our PromQL query more resilient using the absent() function or by checking the age of the metric.
A more robust query could be:
(sum(rabbitmq_queue_messages_ready{queue_name="video_processing"}) or on() vector(0)) unless (absent(rabbitmq_queue_messages_ready{queue_name="video_processing"}))
This is complex. A simpler and often better approach is to rely on alerting. Create an alert in Prometheus that fires if the exporter target is down (up{job="rabbitmq-exporter"} == 0) or if the metric has been absent for more than a few minutes (absent_over_time(rabbitmq_queue_messages_ready[5m])). This operationalizes the failure detection rather than over-complicating the scaling logic.
Edge Case 3: The "Poison Pill" Message
A message enters the queue that causes any worker consuming it to panic and crash. The pod enters a CrashLoopBackOff state. The message is re-queued by RabbitMQ and immediately picked up by another worker, which also crashes.
In this scenario:
- The queue length remains high or grows.
- The new pods also pick up the poison pill message and crash.
You are now paying for a large number of pods that are doing no useful work. This is an application-level problem, but our scaling mechanism is exacerbating it.
Mitigation Strategy:
This requires a multi-faceted defense:
jobs_processed_total and jobs_failed_total. A scaling decision should ideally consider the processing rate, not just the queue length. A more advanced custom KEDA scaler could be written to incorporate this logic: scale_if(queue_length > X AND processing_rate > Y).CrashLoopBackOff for the worker deployment is a critical backstop to catch this scenario quickly.Conclusion: Scaling on Intent, Not Consumption
By integrating KEDA with custom Prometheus metrics, we fundamentally shift our scaling paradigm from being reactive to resource consumption to being proactive based on work-to-be-done. This pattern allows us to build systems that are simultaneously more responsive to load, more resilient to transient spikes, and significantly more cost-effective by enabling true scale-to-zero during idle periods.
While the setup is more involved than a standard HPA, the operational benefits for any non-trivial, event-driven architecture are immense. It provides a level of control and observability that aligns infrastructure behavior directly with business logic, which is the hallmark of a mature, production-grade cloud-native system.