Canary Analysis with Argo Rollouts and Prometheus Metrics

October 3, 2025

17 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Rolling Updates: Data-Driven Deployments

Standard Kubernetes RollingUpdate strategies are a significant improvement over manual deployments, but they operate on a simplistic health check model: is the pod running and responding to a liveness probe? This binary check is insufficient for complex, high-traffic services where subtle performance degradations—a 50ms increase in p99 latency or a 0.5% dip in success rate—can have a substantial business impact. For senior engineers responsible for service reliability, deploying with hope is not a strategy.

This is where progressive delivery, implemented with a tool like Argo Rollouts, becomes a critical practice. Argo Rollouts extends Kubernetes by providing advanced deployment strategies like Blue/Green and Canary. However, its true power lies in its ability to automate the analysis phase of these deployments. Instead of a human manually checking dashboards for 15 minutes, Argo Rollouts can query a metrics provider like Prometheus, analyze the results against predefined Service Level Objectives (SLOs), and automatically promote or roll back the release.

This article dives deep into the mechanics of implementing robust, metric-driven canary analysis using Argo Rollouts and Prometheus. We will bypass introductory concepts and focus on production-grade patterns, complex query construction, and handling the inevitable edge cases that arise in real-world systems.

The Core Primitives: `AnalysisTemplate` and `AnalysisRun`

To understand metric-driven analysis in Argo Rollouts, you must first understand its core CRDs: AnalysisTemplate and AnalysisRun. A Rollout object does not contain the analysis logic itself; it references it.

AnalysisTemplate: A reusable, cluster-wide (or namespace-specific) template that defines how* to perform an analysis. It specifies the metrics provider (e.g., Prometheus), the queries to run, and the success/failure conditions.

* AnalysisRun: An instantiation of an AnalysisTemplate for a specific Rollout at a specific point in time. It records the measurements taken, the results of each measurement, and the final outcome (Successful, Failed, Inconclusive). You inspect AnalysisRun objects to debug a failed canary.

The decoupling is intentional. You can define a standard AnalysisTemplate for HTTP success rate and reuse it across dozens of microservices, promoting standardization of SLOs.

Let's examine a foundational AnalysisTemplate for a Prometheus provider. This template will form the basis of our scenarios.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-check
  namespace: production
spec:
  args:
    # Arguments to be passed in from the Rollout
    - name: service-name
    - name: namespace
  metrics:
    - name: success-rate
      # Minimum number of successful measurements to pass the analysis
      successCondition: result[0] >= 0.99
      # Number of failed measurements before aborting the analysis
      failureLimit: 2
      # How many measurements to take
      count: 5
      # How long to wait between measurements
      interval: 30s
      provider:
        prometheus:
          address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(http_requests_total{job="{{args.service-name}}", namespace="{{args.namespace}}", code=~"2.."}[2m]))
            /
            sum(rate(http_requests_total{job="{{args.service-name}}", namespace="{{args.namespace}}"}[2m]))

Key fields dissected:

* spec.args: These are variables that the Rollout will supply. This makes the template reusable. Here, we parameterize the service-name and namespace to use in our PromQL query.

* successCondition: This is the heart of the analysis. It's a Go expression evaluated against the result of the query. Here, we expect the query to return a single value (result[0]), which must be greater than or equal to 0.99.

* failureLimit: The analysis run fails if it gets this many consecutive failed measurements. A failed measurement is one where the successCondition is false.

* count & interval: Defines the analysis duration. In this case, it will run the query 5 times, waiting 30 seconds between each, for a total analysis duration of approximately 2.5 minutes.

* provider.prometheus.query: The PromQL query. Note the use of {{args.service-name}} to substitute the arguments passed from the Rollout. This query calculates the ratio of 2xx responses to total responses over a 2-minute window.

Scenario 1: A Production-Grade HTTP Success Rate Canary

Let's implement a full canary deployment for a hypothetical checkout-service. We'll deploy a new version, shift 10% of traffic to it, and then run an automated analysis against our 99% success rate SLO before proceeding.

The Application and its Metrics

First, we need an application that exposes Prometheus metrics. Below is a simple Go service using http_requests_total from the official Prometheus client, which is standard practice.

// main.go
package main

import (
	"fmt"
	"log"
	"math/rand"
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	version = "v1.0.0" // This will be injected at build time

	httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
		Name: "http_requests_total",
		Help: "Total number of HTTP requests.",
	}, []string{"code", "method"})
)

func handler(w http.ResponseWriter, r *http.Request) {
	// Simulate some work
	time.Sleep(time.Duration(rand.Intn(150)+50) * time.Millisecond)

	// Introduce a small chance of failure for demonstration
	if rand.Intn(100) < 3 { // 3% failure rate
		w.WriteHeader(http.StatusInternalServerError)
		fmt.Fprintf(w, "Internal Server Error from version %s", version)
		httpRequestsTotal.With(prometheus.Labels{"code": "500", "method": "GET"}).Inc()
		return
	}

	w.WriteHeader(http.StatusOK)
	fmt.Fprintf(w, "Hello from version %s", version)
	httpRequestsTotal.With(prometheus.Labels{"code": "200", "method": "GET"}).Inc()
}

func main() {
	// For v2.0.0, we'll change this to a 0.5% failure rate to pass the canary
	// if version == "v2.0.0" { failure_rate = 0.5 } 

	http.Handle("/metrics", promhttp.Handler())
	http.HandleFunc("/", handler)

	log.Printf("Starting checkout-service version %s on :8080", version)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

The `Rollout` Manifest

Now, we define the Rollout resource. This will replace our standard Deployment object. It looks similar but includes a strategy block that defines the canary process.

yaml

# rollout-checkout-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
  namespace: production
spec:
  replicas: 5
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: checkout-service
          # Start with the stable version
          image: my-repo/checkout-service:v1.0.0 
          ports:
            - containerPort: 8080
  strategy:
    canary:
      # Reference to the Service that will be manipulated to split traffic
      canaryService: checkout-service-canary
      stableService: checkout-service-stable
      steps:
        # 1. Send 10% of traffic to the new version
        - setWeight: 10
        # 2. Pause for analysis. The duration is determined by the AnalysisTemplate
        - pause: {}
          analysis:
            templates:
              - templateName: success-rate-check
            args:
              - name: service-name
                value: checkout-service-canary # Analyze the canary service specifically
              - name: namespace
                value: production
        # 3. If analysis succeeds, ramp up to 50%
        - setWeight: 50
        - pause: { duration: 30s } # Brief pause for stabilization
        # 4. Final ramp-up to 100% and promotion

We also need two Service objects. Argo Rollouts works by changing the label selectors on these services to direct traffic. The main service users hit points to both, while stable and canary services point to their respective pods.

yaml

# services.yaml
apiVersion: v1
kind: Service
metadata:
  name: checkout-service-stable # The stable service
  namespace: production
spec:
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
  selector:
    app: checkout-service
    # Argo Rollouts will inject this hash to point to the stable ReplicaSet
    # rollouts-pod-template-hash: <stable-hash>
---
apiVersion: v1
kind: Service
metadata:
  name: checkout-service-canary # The canary service
  namespace: production
spec:
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
  selector:
    app: checkout-service
    # Argo Rollouts will inject this hash to point to the canary ReplicaSet
    # rollouts-pod-template-hash: <canary-hash>

Execution Flow and Analysis

Initial State: The Rollout checkout-service is running v1.0.0. All 5 replicas are part of the stable ReplicaSet. The checkout-service-stable points to these 5 pods. checkout-service-canary points to nothing.

Deployment Trigger: You update the Rollout's image to my-repo/checkout-service:v2.0.0 (e.g., via kubectl patch or a GitOps commit).

Step 1 (setWeight: 10): Argo Rollouts creates a new ReplicaSet for v2.0.0. It scales the stable ReplicaSet down to 4 replicas and the new canary ReplicaSet up to 1 replica. It then updates the endpoints of the checkout-service-canary to point to the new v2.0.0 pod. If you are using a service mesh like Istio or Linkerd, setWeight would manipulate VirtualService weights instead of replica counts, which is a much more precise way of splitting traffic.

Step 2 (pause: {} with analysis): The rollout pauses. Argo Rollouts creates an AnalysisRun from our success-rate-check template.

* The controller begins the measurement loop: every 30 seconds, it queries Prometheus with sum(rate(http_requests_total{job="checkout-service-canary", ...}[2m])) / sum(rate(http_requests_total{job="checkout-service-canary", ...}[2m])).

* Success Case: The new v2.0.0 has a 0.5% failure rate. The query consistently returns ~0.995. Since 0.995 >= 0.99, each measurement is successful. After 5 successful measurements, the AnalysisRun is marked Successful.

* Failure Case: Imagine v2.0.0 has a bug causing a 4% failure rate. The query returns ~0.96. Since 0.96 < 0.99, the measurement is a failure. If this happens twice (failureLimit: 2), the AnalysisRun is marked Failed. The entire Rollout is aborted and automatically rolled back. The canary ReplicaSet is scaled to zero, and the stable ReplicaSet is scaled back up to 5 replicas.

Promotion: If the analysis succeeds, the rollout proceeds to the next steps, gradually increasing the weight until the new version serves 100% of the traffic and the old ReplicaSet is terminated.

Scenario 2: Advanced Latency and Baseline Comparison

Success rate is a good start, but performance degradation is often more subtle. A common SLO is to ensure the new version's latency is not significantly worse than the stable version's. This requires a more advanced AnalysisTemplate that performs a comparative analysis.

Our new SLO: "The p99 latency of the canary must not be more than 20% higher than the p99 latency of the stable version."

Instrumenting for Latency Histograms

First, our application needs to export histogram metrics for request duration.

// main.go - additions
var (
	httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
		Name:    "http_request_duration_seconds",
		Help:    "Duration of HTTP requests.",
		Buckets: prometheus.DefBuckets, // Default buckets: .005, .01, .025, ...
	}, []string{"version"})
)

func instrumentedHandler(h http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()
		h.ServeHTTP(w, r)
		duration := time.Since(start).Seconds()
		httpRequestDuration.With(prometheus.Labels{"version": version}).Observe(duration)
	})
}

func main() {
    // ... existing code ...
    mainHandler := http.HandlerFunc(handler)
	http.Handle("/", instrumentedHandler(mainHandler))
    // ... existing code ...
}

The Comparative `AnalysisTemplate`

This template is more complex. It will query the latency for both canary and stable pods and compare them. Argo Rollouts provides metadata about the stable and canary ReplicaSet pod hashes, which we can use to construct precise PromQL queries.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-baseline-check
spec:
  args:
    - name: stable-hash
    - name: canary-hash
    - name: service-name
  metrics:
    - name: latency-p99-comparison
      successCondition: result.canary <= result.stable * 1.2
      failureLimit: 1
      count: 3
      interval: 1m
      provider:
        prometheus:
          address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
          query: |
            {
              "canary": histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="{{args.service-name}}", rollouts_pod_template_hash="{{args.canary-hash}}"}[2m])) by (le)),
              "stable": histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="{{args.service-name}}", rollouts_pod_template_hash="{{args.stable-hash}}"}[2m])) by (le))
            }

Critical Differences:

args: We now require stable-hash and canary-hash. These will be provided by the Rollout object itself.

JSON Query: The PromQL query returns a JSON object with two keys: canary and stable. This is a powerful feature of the Argo Rollouts Prometheus provider.

successCondition: The expression now references fields from the JSON result: result.canary <= result.stable * 1.2.

Query Selectors: The query uses rollouts_pod_template_hash="{{args.canary-hash}}" to specifically select metrics from the canary pods. Argo Rollouts automatically adds the rollouts-pod-template-hash label to pods it manages, making this type of precise targeting possible.

Updating the `Rollout`

We modify our Rollout to use this new template. Note the args section, which uses valueFrom to dynamically get the pod template hashes.

yaml

# rollout-checkout-service.yaml (updated analysis step)
# ... previous spec ...
      steps:
        - setWeight: 10
        - pause: {}
          analysis:
            templates:
              - templateName: success-rate-check
              - templateName: latency-baseline-check
            args:
              # Arg for success-rate-check
              - name: service-name
                value: checkout-service-canary
              # Args for latency-baseline-check
              - name: stable-hash
                valueFrom:
                  podTemplateHashValue: Stable
              - name: canary-hash
                valueFrom:
                  podTemplateHashValue: Canary

Now, during the analysis pause, Argo Rollouts will run both analyses concurrently. The rollout only proceeds if both the success rate check and the latency baseline check pass. This creates a much higher-confidence release process.

Advanced Edge Cases and Production Patterns

Implementing the above scenarios will put you on the right path, but production systems are messy. Here are common edge cases and how to address them.

1. The Low-Traffic Problem: Statistical Insignificance

Problem: Your canary analysis runs at 3 AM when traffic is low. The canary gets only 10 requests. 9 are successful. The success rate is 90%, failing the analysis, but the sample size is too small to be meaningful.

Solution: Build a minimum traffic threshold into your PromQL query. Modify the success condition to handle cases where the query returns no data or an empty result.

yaml

# AnalysisTemplate with traffic threshold
# ...
query: |
  let requests = sum(rate(http_requests_total{job="{{args.service-name}}"}[2m]))
  in
  (sum(rate(http_requests_total{job="{{args.service-name}}", code=~"2.."}[2m])) / requests) # success rate
  and
  requests * 120 > 100 # Ensure at least 100 requests in the 2m window

This PromQL query uses a let...in clause for clarity. It calculates the success rate but uses the and operator to filter out results if the total number of requests in the window is below a threshold. If the condition after and is false, the query returns an empty result. We then need to handle this in our successCondition.

yaml

# ...
successCondition: "len(result) == 0 or result[0] >= 0.99"
# ...

The condition len(result) == 0 treats an empty result (due to low traffic) as a success, effectively skipping the analysis for that interval. This is a conscious choice: we'd rather proceed cautiously on low traffic than fail a rollout due to statistical noise.

2. Handling Metric Flakiness and Inconclusive States

Problem: Prometheus is briefly unavailable or a network glitch prevents the Argo Rollouts controller from scraping the metric. The measurement fails, contributing to the failureLimit.

Solution: Use the inconclusiveLimit. An inconclusive measurement occurs if the provider returns an error (e.g., HTTP 503 from Prometheus). This has its own separate counter and does not count towards failureLimit.

yaml

# ...
metrics:
  - name: success-rate
    # ...
    failureLimit: 2
    inconclusiveLimit: 2 # Allow for 2 transient Prometheus errors
    # ...

If the inconclusiveLimit is reached, the AnalysisRun is marked Inconclusive, and the rollout is aborted without being considered a definitive failure. This is useful for alerting a human to investigate the underlying metrics infrastructure.

3. The Initial Ramp-Up Latency Penalty

Problem: A JVM-based service has a JIT warmup period. The first 60 seconds of its life are significantly slower. Your latency analysis runs immediately, compares the cold canary to the warm stable pods, and fails the deployment.

Solution: Introduce an initial delay before the analysis begins. This can be done by adding a separate pause step before the analysis step.

yaml

# ...
steps:
  - setWeight: 10
  - pause: { duration: 90s } # Warm-up pause
  - pause: {} # Analysis pause
    analysis:
      # ...

This gives the canary pod(s) 90 seconds to warm up, complete JIT compilation, fill caches, and establish connection pools before its performance is critically evaluated.

4. Performance Considerations: Query Cost

Problem: You have hundreds of services, each with its own Rollout performing analysis. The p99 latency queries, which scan histogram buckets, are resource-intensive on Prometheus.

Solution: Use Prometheus Recording Rules. A recording rule pre-calculates expensive queries and stores the result as a new time series. Your AnalysisTemplate can then query this much cheaper, pre-aggregated metric.

Prometheus Rule (prometheus-rules.yaml):

yaml

alerts: []
groups:
  - name: service.rules
    rules:
      - record: job:http_request_duration_seconds:p99_2m
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[2m])) by (le, job, namespace, rollouts_pod_template_hash))

This rule creates a new metric called job:http_request_duration_seconds:p99_2m. Now, your AnalysisTemplate can be simplified dramatically:

Simplified AnalysisTemplate:

yaml

# ...
query: |
  {
    "canary": job:http_request_duration_seconds:p99_2m{job="{{args.service-name}}", rollouts_pod_template_hash="{{args.canary-hash}}"},
    "stable": job:http_request_duration_seconds:p99_2m{job="{{args.service-name}}", rollouts_pod_template_hash="{{args.stable-hash}}"}
  }
# ...

This query is vastly more efficient as it no longer performs the histogram_quantile calculation at query time; it simply retrieves the pre-calculated values.

Conclusion: From Deployment to Release Engineering

Adopting metric-driven canary analysis with Argo Rollouts and Prometheus is a significant step in maturing a team's release engineering practices. It shifts the responsibility of release verification from humans to an automated, data-driven system that rigorously enforces SLOs. This process systematically reduces the risk of deploying changes that negatively impact users.

By moving beyond simple success rate checks to comparative baseline analysis and by proactively handling edge cases like low traffic and metric flakiness, you can build a highly resilient and reliable deployment pipeline. This automation frees up senior engineers from the toil of manual release verification, allowing them to focus on building features, knowing that a robust safety net is in place to catch performance regressions before they become production incidents.

Beyond Rolling Updates: Data-Driven Deployments

The Core Primitives: `AnalysisTemplate` and `AnalysisRun`

Scenario 1: A Production-Grade HTTP Success Rate Canary

The Application and its Metrics

The `Rollout` Manifest

Execution Flow and Analysis

Scenario 2: Advanced Latency and Baseline Comparison

Instrumenting for Latency Histograms

The Comparative `AnalysisTemplate`

Updating the `Rollout`

Advanced Edge Cases and Production Patterns

1. The Low-Traffic Problem: Statistical Insignificance

2. Handling Metric Flakiness and Inconclusive States

3. The Initial Ramp-Up Latency Penalty

4. Performance Considerations: Query Cost

Conclusion: From Deployment to Release Engineering

Found this article helpful?