Argo Rollouts: Advanced Canary Analysis with Prometheus Metrics

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond `sleep`: Metric-Driven Progressive Delivery

For any team operating at scale, the standard Kubernetes RollingUpdate strategy presents an unacceptable risk. A faulty deployment can quickly propagate across all pods, leading to widespread outages. While canary deployments are a conceptual improvement, naive implementations often rely on a sleep duration—a period of hopeful observation—before manual promotion. This is insufficient. We are not fortune tellers; we are engineers. We need to replace hope with data.

This is where automated, metric-driven analysis becomes non-negotiable. The goal is to create a system where a deployment's promotion is contingent upon the canary version proving its stability against key Service Level Indicators (SLIs) like error rate and latency. Argo Rollouts provides the powerful orchestration layer, but the intelligence lies in crafting precise and resilient analysis rules against a monitoring backend like Prometheus.

This post is not an introduction to Argo Rollouts. It assumes you are familiar with the Rollout CRD and its basic canary strategy. We will focus exclusively on the advanced application of the AnalysisTemplate and AnalysisRun CRDs to build a production-ready, automated canary promotion gate using live Prometheus metrics.

We will construct a multi-stage analysis process that:

  • Validates a basic success rate against a static threshold.
  • Incorporates multiple, simultaneous metric checks (e.g., success rate and p95 latency).
  • Handles the critical edge case of a canary receiving no traffic.
  • Implements the most robust pattern: comparing the canary's performance directly against the stable version's baseline, making the analysis adaptive to current system conditions.

The Anatomy of a Metric-Driven Rollout

Before diving into complex templates, let's establish our components. A metric-driven canary with Argo Rollouts and Prometheus relies on a tight integration between these parts:

  • The Application: A service instrumented to expose relevant metrics in Prometheus format (e.g., request counts, latencies, error codes).
  • Prometheus: The time-series database scraping and storing these metrics. A ServiceMonitor CRD is typically used to configure this scraping.
  • Argo Rollouts Rollout CRD: Defines the deployment strategy, including the steps for shifting traffic (setWeight) and pausing for analysis.
  • Argo Rollouts AnalysisTemplate CRD: A reusable template defining what to measure and what constitutes success. This is where we'll embed our PromQL queries.
  • Argo Rollouts AnalysisRun CRD: An instantiation of an AnalysisTemplate for a specific rollout, which records the measurements and determines the outcome (Successful, Failed, Inconclusive).
  • Our focus will be on crafting sophisticated AnalysisTemplate resources that move from trivial checks to robust, baseline-aware validations.

    Setting the Stage: A Sample Service and its Metrics

    Let's assume we have a simple Go microservice, checkout-service, which exposes the following Prometheus metrics via its /metrics endpoint:

    * http_requests_total: A counter with labels method, path, and status_code.

    * http_request_duration_seconds: A histogram with label path.

    To make these metrics available to Prometheus, we define a Service and a ServiceMonitor:

    yaml
    # service.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: checkout-service
      labels:
        app: checkout-service
    spec:
      ports:
      - port: 80
        targetPort: 8080
        protocol: TCP
        name: http
      selector:
        app: checkout-service
    ---
    # servicemonitor.yaml
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: checkout-service-monitor
      labels:
        release: prometheus # Assumes you use the prometheus-community helm chart
    spec:
      selector:
        matchLabels:
          app: checkout-service
      endpoints:
      - port: http

    With this in place, Prometheus will begin scraping our application pods, providing the data needed for analysis.

    Phase 1: Simple Success Rate Analysis

    Our first iteration will be a basic check: the canary's HTTP success rate must be above 99%. This involves creating an AnalysisTemplate with a single PromQL query.

    The `AnalysisTemplate`

    This template defines a reusable check for success rate. It's parameterized using args to allow for flexibility.

    yaml
    # analysis-template-success-rate.yaml
    apiVersion: argoproj.io/v1alpha1
    kind: AnalysisTemplate
    metadata:
      name: success-rate-check
    spec:
      args:
        - name: service-name
      metrics:
        - name: success-rate
          # We allow 3 consecutive failures before failing the entire analysis.
          # This accounts for transient issues with Prometheus scraping or query evaluation.
          failureLimit: 3
          # We require at least 5 measurements. The analysis runs every 'interval'.
          count: 5
          interval: 20s
          provider:
            prometheus:
              address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
              query: |
                sum(rate(http_requests_total{
                  service="{{args.service-name}}",
                  status_code=~"^2.."
                }[1m]))
                /
                sum(rate(http_requests_total{
                  service="{{args.service-name}}"
                }[1m]))
          # The success condition defines what a 'passing' result looks like.
          # 'result[0]' refers to the first (and only) result from the query.
          successCondition: result[0] > 0.99

    Dissecting the Query:

    * sum(rate(http_requests_total{...}[1m])): This is the standard way to calculate requests-per-second from a cumulative counter. We use rate over a 1-minute window ([1m]) to get a smoothed-out value.

    * status_code=~"^2..": This regex selects only successful (2xx) status codes.

    * The query calculates the ratio of the rate of successful requests to the rate of all requests.

    * {{args.service-name}}: This is how Argo Rollouts injects arguments into the template, making it reusable for different services.

    Integrating into the `Rollout`

    Now, we modify our Rollout to use this template. The analysis block is added to the canary steps.

    yaml
    # rollout.yaml
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    metadata:
      name: checkout-service
    spec:
      replicas: 5
      strategy:
        canary:
          steps:
          - setWeight: 20
          - pause: {}
          # --- Analysis Step --- #
          - analysis:
              templates:
                - templateName: success-rate-check
              args:
                - name: service-name
                  value: checkout-service-canary # We target the canary-specific service
      # ... selector, template, etc.

    With this configuration, after shifting 20% of traffic to the canary, the rollout will pause. Argo Rollouts will then create an AnalysisRun, which executes the success-rate-check template. It will query Prometheus every 20 seconds for 5 consecutive measurements. If the success rate remains above 99% for all 5 checks, the analysis succeeds, and the rollout proceeds to the next step. If it fails 3 times (failureLimit), the entire rollout is aborted and automatically rolled back.

    Phase 2: Multi-Metric Analysis (Latency and Success Rate)

    Success rate alone is not a complete picture of health. A service could be responding successfully but be unacceptably slow. We need to analyze latency in parallel.

    Let's create a more advanced template that checks both P95 latency and success rate.

    yaml
    # analysis-template-latency-and-success.yaml
    apiVersion: argoproj.io/v1alpha1
    kind: AnalysisTemplate
    metadata:
      name: performance-check
    spec:
      args:
        - name: service-name
        - name: latency-threshold
          value: "0.5" # Default to 500ms
        - name: success-rate-threshold
          value: "0.99"
      metrics:
        - name: success-rate
          failureLimit: 2
          provider:
            prometheus:
              address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
              query: |
                sum(rate(http_requests_total{
                  service="{{args.service-name}}" 
                }[2m])) by (status_code) 
                / 
                sum(rate(http_requests_total{
                  service="{{args.service-name}}"
                }[2m]))
          successCondition: result[0] >= {{args.success-rate-threshold}}
    
        - name: p95-latency
          failureLimit: 2
          provider:
            prometheus:
              address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
              # This query calculates p95 latency from a Prometheus histogram.
              query: |
                histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{
                  service="{{args.service-name}}"
                }[2m])) by (le))
          successCondition: result[0] < {{args.latency-threshold}}

    Key Improvements:

  • Two metrics entries: Argo Rollouts will evaluate both success-rate and p95-latency during each analysis interval. All metrics must meet their successCondition for the interval to be considered successful.
  • Latency Query: histogram_quantile(0.95, ...) is the standard PromQL function for calculating quantiles from a histogram metric (http_request_duration_seconds_bucket).
  • More args: We've parameterized the thresholds themselves, allowing different services to specify their own SLOs when using this template.
  • Your Rollout would now reference this new performance-check template. This provides a much stronger guarantee of canary health before promotion.

    Edge Case: The "No Traffic" Problem

    A subtle but critical failure mode exists in our current setup. What if, due to a service mesh misconfiguration or low overall traffic, the canary service receives zero requests during the analysis window?

    * The success rate query will return NaN (division by zero), which fails the > 0.99 check.

    * The latency query will return an empty result set.

    Argo Rollouts treats these scenarios as Inconclusive. If the inconclusiveLimit is reached, the rollout fails. This is often the desired behavior, but we can be more explicit and provide a clearer failure reason.

    We can add a dedicated metric to ensure a minimum level of traffic is present.

    yaml
    # analysis-template-with-traffic-check.yaml
    apiVersion: argoproj.io/v1alpha1
    kind: AnalysisTemplate
    metadata:
      name: robust-performance-check
    spec:
      args:
        # ... same args as before
        - name: min-requests-per-second
          value: "0.1" # Require at least 1 request every 10 seconds
      metrics:
        - name: traffic-check
          failureLimit: 1
          provider:
            prometheus:
              address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
              query: | 
                sum(rate(http_requests_total{
                  service="{{args.service-name}}"
                }[2m]))
          # Fail fast if traffic is below the minimum threshold.
          successCondition: result[0] >= {{args.min-requests-per-second}}
    
        # ... include success-rate and p95-latency metrics from the previous example

    By adding this traffic-check metric first, we ensure that subsequent, more expensive checks only run if the canary is actually being tested under a meaningful load. If it fails, the AnalysisRun will clearly state that the traffic-check metric failed, making debugging much faster.

    Phase 3: The Gold Standard - Canary vs. Stable Baseline Analysis

    Static thresholds (< 500ms, > 99%) are brittle. System load fluctuates. A background process could increase baseline latency across the entire cluster. In such a scenario, a perfectly healthy canary might fail its analysis simply because its latency is 510ms while the stable version is also at 510ms. The canary isn't worse; the environment has changed.

    The most resilient production pattern is to compare the canary's performance directly against the stable version's performance at the same point in time. The success condition becomes relative: "The canary's error rate must not be more than 10% higher than the stable version's error rate."

    To achieve this, we need to distinguish metrics from canary pods vs. stable pods. Argo Rollouts makes this possible by injecting a unique rollouts-pod-template-hash label onto every pod it manages. We can use this hash to isolate metrics.

    The `AnalysisTemplate` for Baseline Comparison

    This template is significantly more complex but represents a true production-grade analysis strategy.

    yaml
    # analysis-template-baseline.yaml
    apiVersion: argoproj.io/v1alpha1
    kind: AnalysisTemplate
    metadata:
      name: baseline-error-rate-check
    spec:
      args:
        - name: service-name
        - name: prometheus-app-label # The label used to select your app pods
          value: "app"
        # These are automatically injected by Argo Rollouts when the analysis is run
        - name: stable-pod-template-hash
        - name: canary-pod-template-hash
    
      metrics:
        - name: error-rate-comparison
          failureLimit: 3
          interval: 30s
          count: 10
          provider:
            prometheus:
              address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
              query: |
                let
                  canary_error_rate = 
                    sum(rate(http_requests_total{
                      {{args.prometheus-app-label}}="{{args.service-name}}",
                      rollouts-pod-template-hash="{{args.canary-pod-template-hash}}",
                      status_code!~"^2.."
                    }[2m]))
                    /
                    sum(rate(http_requests_total{
                      {{args.prometheus-app-label}}="{{args.service-name}}",
                      rollouts-pod-template-hash="{{args.canary-pod-template-hash}}"
                    }[2m]));
    
                  stable_error_rate = 
                    sum(rate(http_requests_total{
                      {{args.prometheus-app-label}}="{{args.service-name}}",
                      rollouts-pod-template-hash="{{args.stable-pod-template-hash}}",
                      status_code!~"^2.."
                    }[2m]))
                    /
                    sum(rate(http_requests_total{
                      {{args.prometheus-app-label}}="{{args.service-name}}",
                      rollouts-pod-template-hash="{{args.stable-pod-template-hash}}"
                    }[2m]));
    
                # Return 0 if canary rate is less than or equal to stable rate + 2%.
                # Return 1 otherwise. This simplifies the success condition.
                # or on() vector(0) handles cases where stable_error_rate is NaN (no traffic).
                (canary_error_rate > (stable_error_rate + 0.02)) or on() vector(0)
    
          # We succeed if the query result is 0.
          successCondition: result[0] == 0

    Deconstruction of this Advanced Template:

  • Injected args: Argo Rollouts automatically populates {{args.stable-pod-template-hash}} and {{args.canary-pod-template-hash}} with the correct values during an AnalysisRun. This is the magic that allows us to differentiate.
  • PromQL let clause: We use let to define two variables, canary_error_rate and stable_error_rate, for readability. Each one calculates the error rate scoped to pods with the corresponding rollouts-pod-template-hash.
  • The Comparison Logic: The final line, (canary_error_rate > (stable_error_rate + 0.02)) or on() vector(0), is the core of the check.
  • * It checks if the canary error rate is greater than the stable error rate plus a 2% absolute margin.

    * or on() vector(0): This is a crucial piece of PromQL for resiliency. If the right side of the or has no data (e.g., the stable version has no errors, resulting in NaN), the or will return the left side's result (vector(0)). This ensures the analysis doesn't fail due to a lack of errors on the stable version.

  • Simplified successCondition: The query is engineered to return 0 for success and 1 for failure. This makes the successCondition: result[0] == 0 clean and unambiguous.
  • Integrating the Baseline Analysis into the `Rollout`

    The Rollout spec is updated to use this new template. Notice we don't need to provide the pod-template-hash arguments; Argo injects them.

    yaml
    # rollout-baseline.yaml
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    metadata:
      name: checkout-service
    spec:
      replicas: 10
      strategy:
        canary:
          steps:
          - setWeight: 10
          - pause: { duration: 30s } # Initial pause to allow metrics to propagate
          - analysis:
              templates:
                - templateName: baseline-error-rate-check
              args:
                - name: service-name
                  value: checkout-service
                - name: prometheus-app-label
                  value: app
          - setWeight: 50
          - pause: { duration: 5m }
          # You can run the same analysis again at a higher traffic weight
          - analysis:
              templates:
                - templateName: baseline-error-rate-check
              # ... args ...
      # ... selector, template, etc.

    This setup represents a mature progressive delivery pipeline. Deployments are now validated against the live, real-time performance of the existing version, making them resilient to external environmental factors and providing a very high degree of safety.

    Conclusion: From Blind Faith to Data-Driven Confidence

    We have journeyed from simple, static thresholding to a dynamic, baseline-aware analysis that embodies the core principles of SRE and modern DevOps. By leveraging Argo Rollouts' AnalysisTemplate CRD with carefully crafted PromQL, we can build an automated promotion gate that uses system health as its primary signal.

    Key takeaways for production systems:

    * Never trust sleep: Always use metric-driven analysis for canary promotions.

    * Analyze multiple SLIs: A single metric like success rate is insufficient. Combine it with latency, saturation, or other business-critical indicators.

    * Plan for edge cases: Explicitly check for minimum traffic to avoid inconclusive results and ensure your canaries are actually being tested.

    * Prefer baseline comparison: Comparing the canary against the stable version is the most robust strategy, as it makes your deployment pipeline adaptive to real-time system conditions.

    Implementing these patterns moves your CI/CD process from a simple deployment mechanism to an intelligent, self-healing system that actively protects your users and your SLOs. It is a foundational element for any organization aiming to ship features quickly and safely at scale.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles