Canary Analysis with Argo Rollouts and Prometheus Metrics
Beyond Rolling Updates: Data-Driven Deployments
Standard Kubernetes RollingUpdate strategies are a significant improvement over manual deployments, but they operate on a simplistic health check model: is the pod running and responding to a liveness probe? This binary check is insufficient for complex, high-traffic services where subtle performance degradations—a 50ms increase in p99 latency or a 0.5% dip in success rate—can have a substantial business impact. For senior engineers responsible for service reliability, deploying with hope is not a strategy.
This is where progressive delivery, implemented with a tool like Argo Rollouts, becomes a critical practice. Argo Rollouts extends Kubernetes by providing advanced deployment strategies like Blue/Green and Canary. However, its true power lies in its ability to automate the analysis phase of these deployments. Instead of a human manually checking dashboards for 15 minutes, Argo Rollouts can query a metrics provider like Prometheus, analyze the results against predefined Service Level Objectives (SLOs), and automatically promote or roll back the release.
This article dives deep into the mechanics of implementing robust, metric-driven canary analysis using Argo Rollouts and Prometheus. We will bypass introductory concepts and focus on production-grade patterns, complex query construction, and handling the inevitable edge cases that arise in real-world systems.
The Core Primitives: `AnalysisTemplate` and `AnalysisRun`
To understand metric-driven analysis in Argo Rollouts, you must first understand its core CRDs: AnalysisTemplate and AnalysisRun. A Rollout object does not contain the analysis logic itself; it references it.
AnalysisTemplate: A reusable, cluster-wide (or namespace-specific) template that defines how* to perform an analysis. It specifies the metrics provider (e.g., Prometheus), the queries to run, and the success/failure conditions.
* AnalysisRun: An instantiation of an AnalysisTemplate for a specific Rollout at a specific point in time. It records the measurements taken, the results of each measurement, and the final outcome (Successful, Failed, Inconclusive). You inspect AnalysisRun objects to debug a failed canary.
The decoupling is intentional. You can define a standard AnalysisTemplate for HTTP success rate and reuse it across dozens of microservices, promoting standardization of SLOs.
Let's examine a foundational AnalysisTemplate for a Prometheus provider. This template will form the basis of our scenarios.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-check
namespace: production
spec:
args:
# Arguments to be passed in from the Rollout
- name: service-name
- name: namespace
metrics:
- name: success-rate
# Minimum number of successful measurements to pass the analysis
successCondition: result[0] >= 0.99
# Number of failed measurements before aborting the analysis
failureLimit: 2
# How many measurements to take
count: 5
# How long to wait between measurements
interval: 30s
provider:
prometheus:
address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
query: |
sum(rate(http_requests_total{job="{{args.service-name}}", namespace="{{args.namespace}}", code=~"2.."}[2m]))
/
sum(rate(http_requests_total{job="{{args.service-name}}", namespace="{{args.namespace}}"}[2m]))
Key fields dissected:
* spec.args: These are variables that the Rollout will supply. This makes the template reusable. Here, we parameterize the service-name and namespace to use in our PromQL query.
* successCondition: This is the heart of the analysis. It's a Go expression evaluated against the result of the query. Here, we expect the query to return a single value (result[0]), which must be greater than or equal to 0.99.
* failureLimit: The analysis run fails if it gets this many consecutive failed measurements. A failed measurement is one where the successCondition is false.
* count & interval: Defines the analysis duration. In this case, it will run the query 5 times, waiting 30 seconds between each, for a total analysis duration of approximately 2.5 minutes.
* provider.prometheus.query: The PromQL query. Note the use of {{args.service-name}} to substitute the arguments passed from the Rollout. This query calculates the ratio of 2xx responses to total responses over a 2-minute window.
Scenario 1: A Production-Grade HTTP Success Rate Canary
Let's implement a full canary deployment for a hypothetical checkout-service. We'll deploy a new version, shift 10% of traffic to it, and then run an automated analysis against our 99% success rate SLO before proceeding.
The Application and its Metrics
First, we need an application that exposes Prometheus metrics. Below is a simple Go service using http_requests_total from the official Prometheus client, which is standard practice.
// main.go
package main
import (
"fmt"
"log"
"math/rand"
"net/http"
"strconv"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
version = "v1.0.0" // This will be injected at build time
httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests.",
}, []string{"code", "method"})
)
func handler(w http.ResponseWriter, r *http.Request) {
// Simulate some work
time.Sleep(time.Duration(rand.Intn(150)+50) * time.Millisecond)
// Introduce a small chance of failure for demonstration
if rand.Intn(100) < 3 { // 3% failure rate
w.WriteHeader(http.StatusInternalServerError)
fmt.Fprintf(w, "Internal Server Error from version %s", version)
httpRequestsTotal.With(prometheus.Labels{"code": "500", "method": "GET"}).Inc()
return
}
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "Hello from version %s", version)
httpRequestsTotal.With(prometheus.Labels{"code": "200", "method": "GET"}).Inc()
}
func main() {
// For v2.0.0, we'll change this to a 0.5% failure rate to pass the canary
// if version == "v2.0.0" { failure_rate = 0.5 }
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/", handler)
log.Printf("Starting checkout-service version %s on :8080", version)
log.Fatal(http.ListenAndServe(":8080", nil))
}
The `Rollout` Manifest
Now, we define the Rollout resource. This will replace our standard Deployment object. It looks similar but includes a strategy block that defines the canary process.
# rollout-checkout-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-service
namespace: production
spec:
replicas: 5
selector:
matchLabels:
app: checkout-service
template:
metadata:
labels:
app: checkout-service
spec:
containers:
- name: checkout-service
# Start with the stable version
image: my-repo/checkout-service:v1.0.0
ports:
- containerPort: 8080
strategy:
canary:
# Reference to the Service that will be manipulated to split traffic
canaryService: checkout-service-canary
stableService: checkout-service-stable
steps:
# 1. Send 10% of traffic to the new version
- setWeight: 10
# 2. Pause for analysis. The duration is determined by the AnalysisTemplate
- pause: {}
analysis:
templates:
- templateName: success-rate-check
args:
- name: service-name
value: checkout-service-canary # Analyze the canary service specifically
- name: namespace
value: production
# 3. If analysis succeeds, ramp up to 50%
- setWeight: 50
- pause: { duration: 30s } # Brief pause for stabilization
# 4. Final ramp-up to 100% and promotion
We also need two Service objects. Argo Rollouts works by changing the label selectors on these services to direct traffic. The main service users hit points to both, while stable and canary services point to their respective pods.
# services.yaml
apiVersion: v1
kind: Service
metadata:
name: checkout-service-stable # The stable service
namespace: production
spec:
ports:
- port: 80
targetPort: 8080
protocol: TCP
selector:
app: checkout-service
# Argo Rollouts will inject this hash to point to the stable ReplicaSet
# rollouts-pod-template-hash: <stable-hash>
---
apiVersion: v1
kind: Service
metadata:
name: checkout-service-canary # The canary service
namespace: production
spec:
ports:
- port: 80
targetPort: 8080
protocol: TCP
selector:
app: checkout-service
# Argo Rollouts will inject this hash to point to the canary ReplicaSet
# rollouts-pod-template-hash: <canary-hash>
Execution Flow and Analysis
Rollout checkout-service is running v1.0.0. All 5 replicas are part of the stable ReplicaSet. The checkout-service-stable points to these 5 pods. checkout-service-canary points to nothing.Rollout's image to my-repo/checkout-service:v2.0.0 (e.g., via kubectl patch or a GitOps commit).setWeight: 10): Argo Rollouts creates a new ReplicaSet for v2.0.0. It scales the stable ReplicaSet down to 4 replicas and the new canary ReplicaSet up to 1 replica. It then updates the endpoints of the checkout-service-canary to point to the new v2.0.0 pod. If you are using a service mesh like Istio or Linkerd, setWeight would manipulate VirtualService weights instead of replica counts, which is a much more precise way of splitting traffic.pause: {} with analysis): The rollout pauses. Argo Rollouts creates an AnalysisRun from our success-rate-check template. * The controller begins the measurement loop: every 30 seconds, it queries Prometheus with sum(rate(http_requests_total{job="checkout-service-canary", ...}[2m])) / sum(rate(http_requests_total{job="checkout-service-canary", ...}[2m])).
* Success Case: The new v2.0.0 has a 0.5% failure rate. The query consistently returns ~0.995. Since 0.995 >= 0.99, each measurement is successful. After 5 successful measurements, the AnalysisRun is marked Successful.
* Failure Case: Imagine v2.0.0 has a bug causing a 4% failure rate. The query returns ~0.96. Since 0.96 < 0.99, the measurement is a failure. If this happens twice (failureLimit: 2), the AnalysisRun is marked Failed. The entire Rollout is aborted and automatically rolled back. The canary ReplicaSet is scaled to zero, and the stable ReplicaSet is scaled back up to 5 replicas.
ReplicaSet is terminated.Scenario 2: Advanced Latency and Baseline Comparison
Success rate is a good start, but performance degradation is often more subtle. A common SLO is to ensure the new version's latency is not significantly worse than the stable version's. This requires a more advanced AnalysisTemplate that performs a comparative analysis.
Our new SLO: "The p99 latency of the canary must not be more than 20% higher than the p99 latency of the stable version."
Instrumenting for Latency Histograms
First, our application needs to export histogram metrics for request duration.
// main.go - additions
var (
httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests.",
Buckets: prometheus.DefBuckets, // Default buckets: .005, .01, .025, ...
}, []string{"version"})
)
func instrumentedHandler(h http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
h.ServeHTTP(w, r)
duration := time.Since(start).Seconds()
httpRequestDuration.With(prometheus.Labels{"version": version}).Observe(duration)
})
}
func main() {
// ... existing code ...
mainHandler := http.HandlerFunc(handler)
http.Handle("/", instrumentedHandler(mainHandler))
// ... existing code ...
}
The Comparative `AnalysisTemplate`
This template is more complex. It will query the latency for both canary and stable pods and compare them. Argo Rollouts provides metadata about the stable and canary ReplicaSet pod hashes, which we can use to construct precise PromQL queries.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-baseline-check
spec:
args:
- name: stable-hash
- name: canary-hash
- name: service-name
metrics:
- name: latency-p99-comparison
successCondition: result.canary <= result.stable * 1.2
failureLimit: 1
count: 3
interval: 1m
provider:
prometheus:
address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
query: |
{
"canary": histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="{{args.service-name}}", rollouts_pod_template_hash="{{args.canary-hash}}"}[2m])) by (le)),
"stable": histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="{{args.service-name}}", rollouts_pod_template_hash="{{args.stable-hash}}"}[2m])) by (le))
}
Critical Differences:
args: We now require stable-hash and canary-hash. These will be provided by the Rollout object itself.canary and stable. This is a powerful feature of the Argo Rollouts Prometheus provider.successCondition: The expression now references fields from the JSON result: result.canary <= result.stable * 1.2.rollouts_pod_template_hash="{{args.canary-hash}}" to specifically select metrics from the canary pods. Argo Rollouts automatically adds the rollouts-pod-template-hash label to pods it manages, making this type of precise targeting possible.Updating the `Rollout`
We modify our Rollout to use this new template. Note the args section, which uses valueFrom to dynamically get the pod template hashes.
# rollout-checkout-service.yaml (updated analysis step)
# ... previous spec ...
steps:
- setWeight: 10
- pause: {}
analysis:
templates:
- templateName: success-rate-check
- templateName: latency-baseline-check
args:
# Arg for success-rate-check
- name: service-name
value: checkout-service-canary
# Args for latency-baseline-check
- name: stable-hash
valueFrom:
podTemplateHashValue: Stable
- name: canary-hash
valueFrom:
podTemplateHashValue: Canary
Now, during the analysis pause, Argo Rollouts will run both analyses concurrently. The rollout only proceeds if both the success rate check and the latency baseline check pass. This creates a much higher-confidence release process.
Advanced Edge Cases and Production Patterns
Implementing the above scenarios will put you on the right path, but production systems are messy. Here are common edge cases and how to address them.
1. The Low-Traffic Problem: Statistical Insignificance
Problem: Your canary analysis runs at 3 AM when traffic is low. The canary gets only 10 requests. 9 are successful. The success rate is 90%, failing the analysis, but the sample size is too small to be meaningful.
Solution: Build a minimum traffic threshold into your PromQL query. Modify the success condition to handle cases where the query returns no data or an empty result.
# AnalysisTemplate with traffic threshold
# ...
query: |
let requests = sum(rate(http_requests_total{job="{{args.service-name}}"}[2m]))
in
(sum(rate(http_requests_total{job="{{args.service-name}}", code=~"2.."}[2m])) / requests) # success rate
and
requests * 120 > 100 # Ensure at least 100 requests in the 2m window
This PromQL query uses a let...in clause for clarity. It calculates the success rate but uses the and operator to filter out results if the total number of requests in the window is below a threshold. If the condition after and is false, the query returns an empty result. We then need to handle this in our successCondition.
# ...
successCondition: "len(result) == 0 or result[0] >= 0.99"
# ...
The condition len(result) == 0 treats an empty result (due to low traffic) as a success, effectively skipping the analysis for that interval. This is a conscious choice: we'd rather proceed cautiously on low traffic than fail a rollout due to statistical noise.
2. Handling Metric Flakiness and Inconclusive States
Problem: Prometheus is briefly unavailable or a network glitch prevents the Argo Rollouts controller from scraping the metric. The measurement fails, contributing to the failureLimit.
Solution: Use the inconclusiveLimit. An inconclusive measurement occurs if the provider returns an error (e.g., HTTP 503 from Prometheus). This has its own separate counter and does not count towards failureLimit.
# ...
metrics:
- name: success-rate
# ...
failureLimit: 2
inconclusiveLimit: 2 # Allow for 2 transient Prometheus errors
# ...
If the inconclusiveLimit is reached, the AnalysisRun is marked Inconclusive, and the rollout is aborted without being considered a definitive failure. This is useful for alerting a human to investigate the underlying metrics infrastructure.
3. The Initial Ramp-Up Latency Penalty
Problem: A JVM-based service has a JIT warmup period. The first 60 seconds of its life are significantly slower. Your latency analysis runs immediately, compares the cold canary to the warm stable pods, and fails the deployment.
Solution: Introduce an initial delay before the analysis begins. This can be done by adding a separate pause step before the analysis step.
# ...
steps:
- setWeight: 10
- pause: { duration: 90s } # Warm-up pause
- pause: {} # Analysis pause
analysis:
# ...
This gives the canary pod(s) 90 seconds to warm up, complete JIT compilation, fill caches, and establish connection pools before its performance is critically evaluated.
4. Performance Considerations: Query Cost
Problem: You have hundreds of services, each with its own Rollout performing analysis. The p99 latency queries, which scan histogram buckets, are resource-intensive on Prometheus.
Solution: Use Prometheus Recording Rules. A recording rule pre-calculates expensive queries and stores the result as a new time series. Your AnalysisTemplate can then query this much cheaper, pre-aggregated metric.
Prometheus Rule (prometheus-rules.yaml):
alerts: []
groups:
- name: service.rules
rules:
- record: job:http_request_duration_seconds:p99_2m
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[2m])) by (le, job, namespace, rollouts_pod_template_hash))
This rule creates a new metric called job:http_request_duration_seconds:p99_2m. Now, your AnalysisTemplate can be simplified dramatically:
Simplified AnalysisTemplate:
# ...
query: |
{
"canary": job:http_request_duration_seconds:p99_2m{job="{{args.service-name}}", rollouts_pod_template_hash="{{args.canary-hash}}"},
"stable": job:http_request_duration_seconds:p99_2m{job="{{args.service-name}}", rollouts_pod_template_hash="{{args.stable-hash}}"}
}
# ...
This query is vastly more efficient as it no longer performs the histogram_quantile calculation at query time; it simply retrieves the pre-calculated values.
Conclusion: From Deployment to Release Engineering
Adopting metric-driven canary analysis with Argo Rollouts and Prometheus is a significant step in maturing a team's release engineering practices. It shifts the responsibility of release verification from humans to an automated, data-driven system that rigorously enforces SLOs. This process systematically reduces the risk of deploying changes that negatively impact users.
By moving beyond simple success rate checks to comparative baseline analysis and by proactively handling edge cases like low traffic and metric flakiness, you can build a highly resilient and reliable deployment pipeline. This automation frees up senior engineers from the toil of manual release verification, allowing them to focus on building features, knowing that a robust safety net is in place to catch performance regressions before they become production incidents.