Argo Rollouts: Metric-Driven Canary Analysis with AnalysisTemplates

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond `setWeight` and `pause`: The Imperative for Metric-Driven Deployments

For any team operating at scale on Kubernetes, the standard RollingUpdate strategy is often insufficient. It protects against pod-level failures (e.g., crash-looping containers) but offers no defense against application-level regressions that impact business metrics or user experience. Canary deployments, a foundational pattern in progressive delivery, address this by incrementally shifting traffic to a new version while observing its behavior.

However, a naive implementation of canary deployments often falls short. A typical first step with a tool like Argo Rollouts involves using setWeight to route a small percentage of traffic, followed by a pause for a fixed duration. This pattern relies on a human operator to manually verify dashboards and metrics during the pause, a process that is error-prone, non-scalable, and antithetical to the goals of a fully automated CI/CD pipeline.

The true power of progressive delivery is unlocked when the analysis and promotion/rollback decisions are automated based on real-time performance metrics. This is precisely the problem that Argo Rollouts' AnalysisTemplate and AnalysisRun Custom Resource Definitions (CRDs) are designed to solve. This article will bypass introductory concepts and dive directly into the advanced implementation of metric-driven analysis, targeting engineers who are already familiar with Argo Rollouts basics but want to implement production-grade, automated deployment safety gates.

We will construct a complete, end-to-end example that uses Prometheus as a metrics provider to automatically analyze an application's key Service Level Objectives (SLOs)—specifically, request success rate and P99 latency—during a canary rollout. We will focus on building reusable AnalysisTemplates, handling edge cases, and understanding the mechanics of automated rollbacks when an SLO is breached.


The Core Components: `AnalysisTemplate` and `Rollout` Integration

Before we build our solution, let's dissect the key components we'll be orchestrating. This is not an introduction, but a refresher on the spec fields relevant to our advanced use case.

AnalysisTemplate: This is a cluster-scoped or namespace-scoped template for an analysis. Its power lies in its reusability. A single well-crafted template can be used by dozens of Rollout resources.

Key spec fields:

* args: An array of arguments that can be passed into the template from a Rollout, allowing for dynamic values like service names or specific metric thresholds.

* metrics: The core of the template. This array defines the checks to be performed.

* name: A unique name for the metric check.

* interval: How often to run the metric query.

* count: The number of times to collect the metric. The analysis succeeds if all collections succeed.

* successCondition / failureCondition: The expressions evaluated against the query result. A metric is considered successful if successCondition evaluates to true. It fails if failureCondition evaluates to true. If neither is met, it becomes Inconclusive.

* provider: Specifies the backend to query. We will focus on prometheus.

Rollout: The Rollout resource orchestrates the deployment. The critical section for our purpose is within strategy.canary.steps.

Instead of a simple pause, we use an analysis step:

yaml
- analysis:
    templateName: slo-analysis # Reference to our AnalysisTemplate
    args:
      - name: service-name
        value: my-app
      - name: prometheus-label-selector
        value: 'app="my-app"'

This configuration instructs the Argo Rollouts controller to initiate an AnalysisRun based on the slo-analysis template during this step. The rollout is paused until the AnalysisRun completes with a Successful or Failed status. A Failed status will trigger an immediate and automatic rollback of the deployment.


Scenario Setup: A Sample Microservice and its Metrics

To ground our discussion, let's define a target application. It's a simple Go service that exposes an HTTP endpoint and is instrumented with Prometheus metrics. The crucial part is that the Deployment and Service objects are correctly labeled to distinguish between canary and stable pods, which is essential for our Prometheus queries.

1. The Deployment and Service Manifests:

We'll manage the pods via a Rollout resource, but let's first consider the underlying service discovery. The key pattern is to have a single Kubernetes Service that selects pods from both the stable and canary ReplicaSets. The Argo Rollouts controller manages the pod labels and counts to achieve the desired traffic split.

yaml
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-app
  namespace: production
spec:
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
      name: http
  selector:
    app: my-app # This selector will match pods from both stable and canary ReplicaSets

The Argo Rollouts controller will ensure that pods from the canary ReplicaSet have a unique label/hash that can be targeted for metrics, typically rollouts-pod-template-hash.

2. Exposing SLO Metrics to Prometheus:

Our application needs to expose relevant metrics. We will focus on two primary SLOs:

* Success Rate: The percentage of non-5xx HTTP responses.

* P99 Latency: The 99th percentile of request duration.

A typical Prometheus metric exposed by the service might look like this:

text
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{code="200",method="get"} 1027
http_requests_total{code="500",method="get"} 2

# HELP http_request_duration_seconds The HTTP request latency in seconds.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 129
http_request_duration_seconds_bucket{le="0.01"} 234
# ... and so on

With this setup, we can now write Prometheus queries to isolate the performance of the canary deployment.


Deep Dive: Crafting a Reusable, Multi-Metric `AnalysisTemplate`

Let's build our AnalysisTemplate. The goal is to create a generic template that can be reused across multiple services by passing in arguments.

yaml
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: slo-analysis
  namespace: argo-rollouts
spec:
  args:
    # The name of the service being deployed
    - name: service-name
    # The label selector to isolate the service's metrics
    - name: prometheus-label-selector

  metrics:
    - name: success-rate
      # Run the query every 20 seconds
      interval: 20s
      # Run a total of 6 times (total analysis time: 2 minutes)
      count: 6
      # Fail the entire analysis if we get 2 consecutive errors from this metric
      consecutiveErrorLimit: 2
      # Mark as failed if the success rate drops below 99.5%
      failureCondition: 'result[0] <= 0.995'
      # Mark as successful if the success rate is at or above 99.5%
      successCondition: 'result[0] > 0.995'
      provider:
        prometheus:
          address: http://prometheus.kube-prometheus-stack.svc.cluster.local:9090
          # This is a complex PromQL query. Let's break it down.
          query: |
            sum(rate(http_requests_total{job="{{args.service-name}}", {{args.prometheus-label-selector}}, code!~"5.."}[1m]))
            /
            sum(rate(http_requests_total{job="{{args.service-name}}", {{args.prometheus-label-selector}}}[1m]))

    - name: p99-latency
      interval: 30s
      count: 4 # Total analysis time: 2 minutes
      # Fail if P99 latency exceeds 500ms
      failureCondition: 'result[0] > 0.5'
      # Succeed if P99 latency is at or below 500ms
      successCondition: 'result[0] <= 0.5'
      provider:
        prometheus:
          address: http://prometheus.kube-prometheus-stack.svc.cluster.local:9090
          query: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="{{args.service-name}}", {{args.prometheus-label-selector}}}[2m])) by (le))

Deconstructing the `AnalysisTemplate`:

  • Reusability with args: We define service-name and prometheus-label-selector as arguments. This allows a Rollout for service-a and service-b to use the same template by passing different values. The {{args.service-name}} syntax is used for substitution within the queries.
  • Success Rate Query: This PromQL query is critical.
  • * sum(rate(http_requests_total{...}[1m])) calculates the per-second average number of requests over the last minute.

    * code!~"5.." in the numerator filters for all non-5xx status codes.

    * The division gives us the ratio of successful requests to total requests.

    {{args.prometheus-label-selector}} is where the magic happens. The Rollout will pass a selector that targets only the canary pods*, e.g., 'rollouts-pod-template-hash="12345678"'. This ensures we are not polluting our analysis with metrics from the stable version.

  • P99 Latency Query: This query uses histogram_quantile to calculate the 99th percentile latency from our http_request_duration_seconds histogram. This is far more robust than averaging latency.
  • Conditions (successCondition, failureCondition): We are explicit. result[0] <= 0.995 means if the single value returned by PromQL is less than or equal to our SLO, the check fails. result is an array of values returned by the query; for these queries, we expect a single scalar value.
  • Timing (interval, count): The success-rate metric is checked every 20 seconds for 2 minutes. All 6 checks must succeed for the metric to be considered successful for the entire AnalysisRun. The p99-latency is checked less frequently because latency metrics can be more volatile and benefit from a larger time window ([2m] in the query).

  • Tying it Together: The Advanced Canary `Rollout`

    Now, let's create a Rollout resource that uses our AnalysisTemplate in a multi-step canary strategy.

    yaml
    # rollout.yaml
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    metadata:
      name: my-app
      namespace: production
    spec:
      replicas: 10
      selector:
        matchLabels:
          app: my-app
      template:
        metadata:
          labels:
            app: my-app
        spec:
          containers:
            - name: my-app
              image: my-repo/my-app:1.1.0 # The new version
              ports:
                - containerPort: 8080
    
      strategy:
        canary:
          # This service is used to route traffic to the canary pods
          # It will be managed by the Rollouts controller.
          canaryService: my-app-canary
          # This is the stable service we defined earlier.
          stableService: my-app
          steps:
            # Step 1: Send 10% of traffic to the canary
            - setWeight: 10
    
            # Step 2: Run analysis for 2 minutes at 10% weight
            - analysis:
                templateName: slo-analysis
                # We pass arguments to our template.
                # The controller automatically adds the correct pod hash selector.
                args:
                  - name: service-name
                    value: my-app
                  - name: prometheus-label-selector
                    value: 'app="my-app"'
                # Crucially, we only need to provide the base selector.
                # Argo Rollouts will automatically append the correct
                # rollouts-pod-template-hash to isolate the canary metrics.
    
            # Step 3: If analysis succeeds, ramp up to 50%
            - setWeight: 50
    
            # Step 4: Pause for manual verification or a longer soak time if desired
            - pause: { duration: 10m }
    
            # Step 5: Final, shorter analysis before full promotion
            - analysis:
                templateName: slo-analysis
                args:
                  - name: service-name
                    value: my-app
                  - name: prometheus-label-selector
                    value: 'app="my-app"'

    Analysis of the `Rollout` Strategy:

    * canaryService / stableService: This is a more advanced traffic routing method than using a single service. The controller will manage two services, my-app (pointing to stable) and my-app-canary (pointing to canary), and use a traffic management plugin (like an Ingress controller or Service Mesh) to split traffic between them. This provides better isolation.

    * Multi-Step Analysis: We don't just run one analysis. We run an initial analysis at a low traffic weight (10%). This is a crucial safety check. If it passes, we are more confident in ramping up the traffic to 50%. We then have a final, shorter analysis run before going to 100%. This progressive validation is a hallmark of a mature CI/CD process.

    * Argument Passing: Note how the Rollout's analysis step references slo-analysis and passes the required args. The Argo Rollouts controller is smart enough to inject the specific rollouts-pod-template-hash into the Prometheus query, so you don't have to manage it manually. It augments the selector you provide.


    Production Patterns and Edge Case Handling

    Implementing the above is a great start, but production environments are messy. Here's how to handle common edge cases and advanced patterns.

    1. Edge Case: Metric Flapping and Inconclusive Results

    What happens if a metric query returns a result that is neither a success nor a failure? Or if Prometheus is temporarily unavailable?

    * inconclusive State: If a metric's successCondition and failureCondition are both false, its status becomes Inconclusive. By default, an Inconclusive result is treated as a failure. You can change this behavior with the inconclusiveAsSuccess: true field on the metric, but this is generally not recommended as it can mask underlying problems.

    * consecutiveErrorLimit: We used this in our template. If the Prometheus query fails (e.g., timeout, network error) more than consecutiveErrorLimit times, the metric fails. This prevents a transient issue with the metrics backend from failing an otherwise healthy rollout.

    * Designing Robust Queries: Your PromQL queries should be designed to always return a value. For example, avg(my_metric) or vector(0) ensures that if my_metric has no data, the query returns 0 instead of an empty result, preventing an Inconclusive state.

    2. Pattern: Pre- and Post-Promotion Analysis

    Sometimes, you need to run checks before fully promoting a version or after it has received 100% of the traffic.

    prePromotionAnalysis: This runs an AnalysisRun after all canary steps are complete but before* the stable ReplicaSet is scaled down and the new version is fully promoted. This is a final gatekeeper.

    postPromotionAnalysis: This runs an AnalysisRun after* the rollout is considered fully successful. It's useful for a final health check and can be used to monitor the new version for a few minutes after it has taken over all traffic, catching issues that only appear under full load.

    yaml
    # ... inside strategy.canary ...
          steps:
            # ... regular canary steps
    
          analysis:
            # This runs before the final promotion
            prePromotionAnalysis:
              templateName: smoke-test-analysis
            # This runs after the rollout is marked as successful
            postPromotionAnalysis:
              templateName: soak-test-analysis

    3. Pattern: Dynamic Thresholds with `web` Metrics

    What if your SLO thresholds aren't static? For example, you might want to fail the canary if its error rate is 2x higher than the stable version's error rate, rather than using a fixed value.

    This requires a more advanced pattern where an external service calculates the dynamic threshold. You can use the web metric provider in Argo Rollouts.

  • Create an internal API endpoint (e.g., /check-baseline) that takes the canary's metrics as input, queries Prometheus for the stable version's metrics, compares them, and returns a JSON response indicating success or failure.
  • Configure your AnalysisTemplate to use the web provider:
  • yaml
    - name: dynamic-error-rate-check
      provider:
        web:
          url: http://my-analysis-service.internal/check-baseline
          jsonPath: '{.status}' # Path to the success/failure field in the JSON response

    This is a powerful technique for implementing comparative analysis, which is often more effective than static thresholding.

    The Automated Rollback in Action

    Let's visualize what happens when a canary deployment fails its analysis.

  • A developer merges a change, a new image my-app:1.1.1 is built, and the Rollout manifest in Git is updated.
    • ArgoCD syncs the change, and the Argo Rollouts controller begins the rollout.
  • The first step, setWeight: 10, completes. A new ReplicaSet for 1.1.1 is created, and 10% of traffic is routed to it.
  • The analysis step begins. An AnalysisRun CRD is created.
  • The controller starts querying Prometheus every 20 seconds for the success-rate.
  • The new code has a bug that causes a 2% error rate. The first query returns 0.98.
  • The failureCondition: 'result[0] <= 0.995' evaluates to true.
  • The AnalysisRun is immediately marked as Failed.
  • The Argo Rollouts controller detects the failed AnalysisRun.
  • The controller immediately aborts the rollout. It scales down the canary ReplicaSet (1.1.1) to zero and modifies the traffic routing to send 100% of traffic back to the stable ReplicaSet (1.1.0).
  • The Rollout resource status becomes Degraded.
  • This entire process happens automatically within seconds of the metric violation, without any human intervention, dramatically reducing the Mean Time to Recovery (MTTR) and the blast radius of a bad deployment.

    Conclusion: From Automation to Autonomy

    By leveraging Argo Rollouts with metric-driven AnalysisTemplates, we elevate our deployment strategy from simple automation to a form of autonomy. We are no longer just scripting deployment steps; we are building a system that can make intelligent, data-driven decisions about production readiness.

    This approach codifies SLOs directly into the deployment pipeline, creating a powerful safety net that allows engineering teams to ship features faster and with higher confidence. The key takeaways for senior engineers are:

    * Stop Relying on pause: Manual verification is a bottleneck and an anti-pattern. Automate analysis with metrics.

    * Embrace Reusability: Build generic AnalysisTemplates with args to enforce consistent quality gates across all your microservices.

    * Isolate Canary Metrics: Ensure your Prometheus queries can precisely target the new version, separate from the stable version.

    * Plan for Edge Cases: Design your analysis to be resilient to transient issues with your metrics provider and to handle inconclusive states gracefully.

    * Think in Steps: Use multi-step analysis to progressively build confidence in a new release before exposing it to all users.

    By implementing these advanced patterns, you can transform your CI/CD pipeline from a simple conveyor belt for code into a sophisticated, self-healing release management system.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles