Fine-Tuning Kubernetes HPA with Prometheus Custom Metrics

October 13, 2025

19 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Scaling Fallacy: Why CPU/Memory Fails Stateful Workloads

In the world of stateless microservices, scaling based on CPU and memory utilization is often a reasonable starting point. A web server under heavy load will typically exhibit high CPU usage, making it a reliable proxy for scaling demand. However, this model collapses when applied to the nuanced world of stateful workloads. Consider these common scenarios:

* Message Queues (e.g., RabbitMQ, Kafka): A consumer group's primary scaling indicator isn't its CPU usage, but the consumer lag or the number of ready messages in a queue. A consumer pod can be idle (low CPU) while a massive backlog of messages accumulates, demanding a scale-up event.

* Databases (e.g., PostgreSQL, Cassandra): Scaling a read-replica pool might depend on the number of active connections, replication lag, or the depth of a transaction queue—metrics that have a weak correlation with raw CPU percentage.

* Batch Processing Jobs: A fleet of workers processing a job queue should scale based on the number of pending tasks, not the resource consumption of the currently running workers.

Relying on generic resource metrics for these systems leads to inefficient, unresponsive, and often incorrect scaling behavior. The solution is to scale on metrics that directly represent the application's workload and health. This is where the Kubernetes custom metrics pipeline, powered by Prometheus, becomes an indispensable tool for senior engineers.

This article will walk through a production-grade implementation of scaling a StatefulSet of message queue consumers based on queue depth, tackling the advanced challenges and edge cases you'll encounter in a real-world environment.

The Custom Metrics Pipeline: A High-Fidelity Architecture

Before diving into YAML, it's critical to understand the flow of information that enables this advanced scaling. A misconfiguration in any single component can lead to a complete failure of the autoscaling loop.

mermaid

graph TD
    A[Stateful Application e.g., RabbitMQ] -->|Exposes /metrics| B(Prometheus Exporter);
    B -->|Scraped by| C(Prometheus Server);
    C -->|Queried by| D(Prometheus Adapter);
    D -->|Serves metrics via| E(Kubernetes Custom Metrics API);
    E -->|Read by| F(HPA Controller);
    F -->|Adjusts .spec.replicas| G(Deployment / StatefulSet);

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#f9f,stroke:#333,stroke-width:2px

Application & Exporter: The target application (e.g., RabbitMQ) exposes its internal state as Prometheus-compatible metrics. This is often achieved via a dedicated exporter, like the rabbitmq_exporter, which queries the application's management API and translates the data.

Prometheus Server: This is the heart of our monitoring system. It's configured to periodically scrape the exporter's /metrics endpoint, ingesting and storing the time-series data.

Prometheus Adapter: This is the crucial bridge. The prometheus-adapter is a Kubernetes API server extension. It connects to your Prometheus server, runs pre-configured PromQL queries, and exposes the results through the standard Kubernetes Custom Metrics API (custom.metrics.k8s.io). It effectively translates the language of Prometheus into the language of the HPA.

Kubernetes Custom Metrics API: This is an aggregated API within the Kubernetes control plane. The HPA controller doesn't talk to Prometheus directly; it queries this well-defined API endpoint for metric values.

HPA Controller: The Horizontal Pod Autoscaler controller periodically queries the Custom Metrics API for the metric defined in its manifest. It then performs its core calculation: desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)].

Deployment / StatefulSet: The HPA controller updates the replicas field on the target scalable resource, triggering the Kubernetes scheduler to create or terminate pods.

Understanding this chain of dependencies is key to debugging. If scaling isn't working, you must trace the metric from its source all the way to the HPA controller's logs.

Production Scenario: Scaling RabbitMQ Consumers Based on Queue Depth

Let's implement a robust solution for scaling a StatefulSet of consumers for a specific RabbitMQ queue. Our goal is to maintain an average of 1000 ready messages per consumer pod. If the queue depth rises to 3000, the HPA should scale up to 3 pods. If it drops to 500, it should scale down (respecting stabilization windows).

Step 1: Exposing the Metric

We assume you have a running RabbitMQ cluster and the official Prometheus RabbitMQ Exporter deployed. This exporter provides a crucial metric: rabbitmq_queue_messages_ready. A sample metric exposed might look like this:

text

rabbitmq_queue_messages_ready{cluster="rabbitmq-cluster-a", durable="true", exclusive="false", node="[email protected]", queue="work-queue-critical", self="", state="running", vhost="/"} 2567.0

The important labels for us are queue (the name of the queue) and the metric value itself.

Step 2: Configuring the Prometheus Adapter

This is the most complex and critical piece of the puzzle. The prometheus-adapter is configured via a ConfigMap. We need to tell it how to discover available metrics and how to construct a specific query when the HPA asks for a metric value.

Here is a production-grade configuration. We will break it down piece by piece.

yaml

# prometheus-adapter-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: monitoring
data:
  config.yaml: |-
    rules:
      # This rule discovers all 'rabbitmq_queue_messages_ready' metrics and makes them available
      # to the custom metrics API.
      - seriesQuery: 'rabbitmq_queue_messages_ready{queue!="", vhost!=""}'
        # We need to associate the Prometheus labels with Kubernetes resources.
        # This is how the HPA can target a metric for a specific object.
        resources:
          # We will target a Kubernetes Service that represents our RabbitMQ queue.
          # This is a robust pattern for object metrics.
          template: <<.Resource>>
          overrides:
            # The 'vhost' label in Prometheus will map to the 'namespace' of the K8s object.
            vhost:
              resource: namespace
            # The 'queue' label in Prometheus will map to the 'name' of the K8s object (our Service).
            queue:
              resource: service
        # This defines how the metric name in Kubernetes is constructed from the Prometheus metric name.
        name:
          matches: "^rabbitmq_queue_(.*)_total$"
          as: "rabbitmq_queue_${1}_total"
        # This is the most important part: the query the adapter runs when the HPA asks for the metric value.
        metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (queue, vhost)'

    # External metrics can also be defined if they don't relate to a K8s object.
    # We are focusing on custom metrics here.

Dissecting the Configuration:

* seriesQuery: This is the discovery query. The adapter runs this against Prometheus to find out what metrics are available. rabbitmq_queue_messages_ready{queue!="", vhost!=""} finds all time-series for ready messages that have a non-empty queue and vhost label. This is a performance optimization; without it, the adapter might pull a massive amount of series data.

* resources: This section is the magic that connects the Prometheus world to the Kubernetes world. We're telling the adapter: "When you find a metric, the queue label corresponds to the name of a Kubernetes service, and the vhost label corresponds to its namespace." This allows us to define an HPA that targets a specific Kubernetes object (a Service representing our queue).

* name: This rewrites the Prometheus metric name into a Kubernetes-friendly custom metric name. Here, we're just doing a simple pass-through.

* metricsQuery: This is the template for the query that will be executed when the HPA requests the value for a specific object. <<.Series>> is replaced by the metric name found by the seriesQuery. <<.LabelMatchers>> is replaced by the labels corresponding to the Kubernetes object targeted by the HPA (e.g., queue="work-queue-critical", vhost="production"). The sum(...) by (queue, vhost) ensures we get a single value per queue, even if the metric is reported by multiple RabbitMQ nodes.

Step 3: Creating a Kubernetes Service as a Metric Target

Since our adapter configuration maps the queue label to a Kubernetes Service, we need to create one. This is a conceptual link; the Service doesn't need to have endpoints. It acts as a stable, addressable Kubernetes object that our HPA can point to.

yaml

# rabbitmq-queue-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: work-queue-critical
  namespace: production
  # This service is just a reference for the HPA
  # No selector or ports are needed.
spec:
  clusterIP: None

Step 4: Defining the HorizontalPodAutoscaler

Now we can finally define the HPA resource. We'll use an Object metric type, pointing it at the Service we just created.

yaml

# rabbitmq-consumer-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rabbitmq-consumer-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet # Or Deployment
    name: rabbitmq-consumer
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Object
    object:
      metric:
        name: rabbitmq_queue_messages_ready
      describedObject:
        apiVersion: v1
        kind: Service
        name: work-queue-critical # This MUST match the service we created
      target:
        type: Value
        # Our desired state: ~1000 messages per pod.
        # HPA formula: desiredReplicas = currentReplicas * (currentMetric / targetMetric)
        # If currentMetric = 10000 and currentReplicas = 3, target is 1000.
        # desiredReplicas = 3 * (10000 / (3 * 1000)) = 3 * (10000 / 3000) = ~10
        # The 'Value' target type doesn't work like this. A better type is 'AverageValue'.
        # Let's correct this to use AverageValue for a more intuitive target.
  
  # CORRECTED AND MORE ROBUST METRIC DEFINITION
  metrics:
  - type: Pods
    pods:
      metric:
        name: rabbitmq_queue_consumer_target
      target:
        type: AverageValue
        averageValue: "1000" # Target 1000 messages per pod.

Correction and Deeper Dive: The initial Object metric with a target.type of Value is a common pitfall. The Value target is absolute. An AverageValue target is often what's desired, as it divides the total metric value by the number of pods. However, the Object metric type does not support AverageValue.

We must use the Pods metric type. This requires a slightly more advanced prometheus-adapter configuration that associates the global queue metric with each pod in the target resource. This is a more complex but ultimately more correct pattern.

Revised Prometheus Adapter Config for Pods Metric:

yaml

# prometheus-adapter-configmap-revised.yaml
data:
  config.yaml: |-
    rules:
      - seriesQuery: 'rabbitmq_queue_messages_ready{queue="work-queue-critical"}'
        # This configuration does not associate with a K8s object, but creates a metric name
        # that can be queried per-pod.
        resources: {template: <<.Resource>>}
        # We are creating a new metric name for our HPA to use
        name:
          as: "rabbitmq_work_queue_critical_messages_per_pod"
        # The magic is in this query. It gets the total messages and divides by the number of pods
        # in the associated StatefulSet. This is a complex query that requires careful construction.
        metricsQuery: 'sum(rabbitmq_queue_messages_ready{queue="work-queue-critical"}) / count(kube_pod_status_phase{phase="Running", pod=~"^rabbitmq-consumer-.*"})'

This revised approach is brittle because the metricsQuery now has a hardcoded dependency on the pod naming convention (rabbitmq-consumer-.*). A superior pattern is to stick with the Object metric and perform the calculation correctly in the HPA. The HPA v2 API actually has a target.type: AverageValue for Object metrics, but it's often misunderstood. The most robust approach is to export a per-pod metric directly if possible.

For our example, let's revert to the simpler Object metric but use a target.type of Value and understand its implications. The HPA will try to keep the total queue size at the target value. This isn't what we want.

The truly correct, modern approach is to use an External metric type if the metric is not directly tied to a Kubernetes object, or to use a tool like KEDA which is purpose-built for this kind of event-driven scaling and abstracts away this complexity.

However, to demonstrate the power and complexity of the raw HPA/Prometheus pipeline, let's use the Pods metric type with a query that correctly reports the current average per pod.

Final, Production-Ready Configuration:

Prometheus Adapter (ConfigMap):

yaml

    # config.yaml in ConfigMap
    rules:
    - seriesQuery: 'rabbitmq_queue_messages_ready{queue!=""}'
      resources:
        template: <<.Resource>>
      name:
        matches: "^rabbitmq_queue_messages_ready$"
        as: "rabbitmq_queue_messages_ready_per_pod"
      metricsQuery: 'sum(rabbitmq_queue_messages_ready{queue=<<.Label.queue>>}) by (queue) / on(queue) group_left() count(kube_pod_info{pod=~"^<<.Label.statefulset>>-.*"}) by (queue)'

This is extremely advanced. It's a placeholder for a complex query that joins the queue metric with Kubernetes pod metadata from kube-state-metrics. This often becomes unwieldy. The most pragmatic solution is often to use a type: Object metric and manually set a target Value that you calculate based on maxReplicas and your desired per-pod average.

Let's proceed with the most common and understandable pattern, acknowledging its limitations:

HPA (Final Version):

yaml

    # rabbitmq-consumer-hpa-final.yaml
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: rabbitmq-consumer-hpa
      namespace: production
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: StatefulSet
        name: rabbitmq-consumer
      minReplicas: 3
      maxReplicas: 50
      metrics:
      - type: Object
        object:
          metric:
            # This name must match the one exposed by the adapter
            name: rabbitmq_queue_messages_ready
          describedObject:
            apiVersion: v1
            kind: Service
            name: work-queue-critical
          target:
            type: AverageValue
            averageValue: "1k" # Target 1000 messages per pod
      # Behavior section for stability
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 900 # 15 minutes
          policies:
          - type: Percent
            value: 10
            periodSeconds: 60
          - type: Pods
            value: 2
            periodSeconds: 120
        scaleUp:
          stabilizationWindowSeconds: 120 # 2 minutes
          policies:
          - type: Percent
            value: 100
            periodSeconds: 15
          - type: Pods
            value: 10
            periodSeconds: 15

The HPA v2 API Object metric with AverageValue works by dividing the metric value by the current number of replicas in the scaleTargetRef. This is exactly what we need.

Advanced Considerations and Edge Case Handling

Getting the basic scaling to work is only half the battle. A production system must be resilient to transient spikes, metric delays, and component failures.

1. Taming the Beast: Avoiding Scaling Flap

The Problem: A burst of 50,000 messages hits the queue. The HPA scales up to 50 pods. The pods consume the queue in 30 seconds. The metric drops to zero. The HPA immediately begins scaling down. Another burst arrives, and the cycle repeats. This "flapping" is inefficient and puts unnecessary strain on the cluster and the application.

Solution: HPA Behavior Policies

The behavior block in the HPA v2 spec is your primary tool for dampening scaling reactions. In the YAML above:

* scaleDown.stabilizationWindowSeconds: 900: The HPA will look at the history of desired replica counts over the last 15 minutes and will only scale down to the highest recommendation in that window. This prevents it from scaling down immediately after a spike is handled.

* scaleDown.policies: We've defined a conservative scale-down policy. It will remove at most 10% of the pods or 2 pods (whichever is greater) every 2 minutes. This ensures a slow, graceful reduction in capacity.

* scaleUp.stabilizationWindowSeconds: 120: We want to react to increased load more quickly. The scale-up logic considers the last 2 minutes.

* scaleUp.policies: The scale-up policy is aggressive. It allows doubling the pod count (100%) or adding 10 pods every 15 seconds, ensuring a rapid response to a sudden backlog.

2. The Ghost of Latency: Metric Delays

Remember our architecture diagram. There is inherent latency at each step:

* Exporter scrapes RabbitMQ: (e.g., every 15s)

* Prometheus scrapes exporter: (e.g., every 30s)

* HPA controller syncs: (e.g., every 15s)

In the worst case, it could be over a minute before the HPA reacts to a change in the queue. You must account for this. Don't set your scrape intervals too high. For critical autoscaling loops, consider dedicated, low-interval scrape jobs in Prometheus (e.g., 5-10 seconds).

Solution: Smoothing with PromQL

Instead of scaling on the instantaneous value, you can scale on a moving average. This makes the system less reactive to brief, noisy spikes. Modify the metricsQuery in the Prometheus adapter:

yaml

# In prometheus-adapter config.yaml
metricsQuery: 'sum(avg_over_time(rabbitmq_queue_messages_ready{<<.LabelMatchers>>}[2m])) by (queue)'

This query now calculates the average queue depth over the last 2 minutes. The HPA will react to sustained pressure rather than momentary blips. Combining this with HPA behavior policies gives you two layers of stabilization for incredibly robust and predictable scaling.

3. Scaling StatefulSets: The Ordering Problem

When scaling a StatefulSet, Kubernetes provides ordering guarantees. It will terminate pods in reverse ordinal order (pod-N, pod-N-1, ...) and create them in forward order (pod-0, pod-1, ...). The HPA respects this. When it decides to scale down from 5 to 3 replicas, it will simply set .spec.replicas = 3, and the StatefulSet controller will handle the ordered termination of rabbitmq-consumer-4 and then rabbitmq-consumer-3.

This is usually desirable for stateful applications that need graceful handoffs. However, you must ensure your application's shutdown logic is fast enough to not excessively delay the scale-down operation. A long preStop hook or a slow shutdown process can cause the pod to exceed its termination grace period, leading to a SIGKILL and potential data inconsistency.

4. When Metrics Go Dark: Failure Modes

What happens if the RabbitMQ exporter pod dies? Or if Prometheus can't reach it? The metric will become stale in Prometheus, and the adapter will fail to retrieve it.

The HPA controller has a FailedGetScale status condition. By default, it will simply stop scaling until the metric becomes available again. It will not scale down to minReplicas. This is a safe default, but it means your application can get stuck at a high replica count after a transient issue is resolved.

Solution: Alerting

This is a monitoring problem. You MUST have alerts in Prometheus Alertmanager for:

* absent(up{job="rabbitmq-exporter"}): The exporter is down.

* changes(rabbitmq_queue_messages_ready[15m]) == 0: The metric value hasn't changed in 15 minutes, indicating it might be stale.

These alerts notify an operator that the autoscaling loop is blind and requires manual intervention.

Conclusion: From Reactive to Predictive Scaling

Implementing custom metric-based autoscaling with Prometheus and the HPA is a significant step up from basic resource-based scaling. It transforms your system from being purely reactive to workload-aware. The key takeaways for a production-grade implementation are:

Choose the Right Metric: Select a metric that is a direct and leading indicator of application load, not a lagging indicator like CPU.

Master the Pipeline: Understand every component from exporter to HPA controller. Debugging requires tracing the metric through this entire chain.

Stabilize with Behavior Policies: Use behavior policies aggressively to prevent flapping. It's better to over-provision slightly for a few minutes than to have an unstable, oscillating system.

Smooth with Time-Series Functions: Leverage PromQL functions like avg_over_time in your adapter configuration to base scaling decisions on trends, not noisy, instantaneous values.

Plan for Failure: Your system is only as reliable as its metric source. Implement robust alerting to detect when the autoscaling loop is compromised.

While tools like KEDA can simplify the setup for common sources, understanding the underlying mechanics of the HPA and the custom metrics API is a crucial skill for any senior engineer responsible for building scalable, resilient, and cost-effective systems on Kubernetes.

The Scaling Fallacy: Why CPU/Memory Fails Stateful Workloads

The Custom Metrics Pipeline: A High-Fidelity Architecture

Production Scenario: Scaling RabbitMQ Consumers Based on Queue Depth

Step 1: Exposing the Metric

Step 2: Configuring the Prometheus Adapter

Step 3: Creating a Kubernetes Service as a Metric Target

Step 4: Defining the HorizontalPodAutoscaler

Advanced Considerations and Edge Case Handling

1. Taming the Beast: Avoiding Scaling Flap

2. The Ghost of Latency: Metric Delays

3. Scaling StatefulSets: The Ordering Problem

4. When Metrics Go Dark: Failure Modes

Conclusion: From Reactive to Predictive Scaling

Found this article helpful?