Argo Rollouts: Advanced Canary Analysis with Prometheus Metrics
Beyond `sleep`: Metric-Driven Progressive Delivery
For any team operating at scale, the standard Kubernetes RollingUpdate strategy presents an unacceptable risk. A faulty deployment can quickly propagate across all pods, leading to widespread outages. While canary deployments are a conceptual improvement, naive implementations often rely on a sleep duration—a period of hopeful observation—before manual promotion. This is insufficient. We are not fortune tellers; we are engineers. We need to replace hope with data.
This is where automated, metric-driven analysis becomes non-negotiable. The goal is to create a system where a deployment's promotion is contingent upon the canary version proving its stability against key Service Level Indicators (SLIs) like error rate and latency. Argo Rollouts provides the powerful orchestration layer, but the intelligence lies in crafting precise and resilient analysis rules against a monitoring backend like Prometheus.
This post is not an introduction to Argo Rollouts. It assumes you are familiar with the Rollout CRD and its basic canary strategy. We will focus exclusively on the advanced application of the AnalysisTemplate and AnalysisRun CRDs to build a production-ready, automated canary promotion gate using live Prometheus metrics.
We will construct a multi-stage analysis process that:
- Validates a basic success rate against a static threshold.
- Incorporates multiple, simultaneous metric checks (e.g., success rate and p95 latency).
- Handles the critical edge case of a canary receiving no traffic.
- Implements the most robust pattern: comparing the canary's performance directly against the stable version's baseline, making the analysis adaptive to current system conditions.
The Anatomy of a Metric-Driven Rollout
Before diving into complex templates, let's establish our components. A metric-driven canary with Argo Rollouts and Prometheus relies on a tight integration between these parts:
ServiceMonitor CRD is typically used to configure this scraping.Rollout CRD: Defines the deployment strategy, including the steps for shifting traffic (setWeight) and pausing for analysis.AnalysisTemplate CRD: A reusable template defining what to measure and what constitutes success. This is where we'll embed our PromQL queries.AnalysisRun CRD: An instantiation of an AnalysisTemplate for a specific rollout, which records the measurements and determines the outcome (Successful, Failed, Inconclusive).Our focus will be on crafting sophisticated AnalysisTemplate resources that move from trivial checks to robust, baseline-aware validations.
Setting the Stage: A Sample Service and its Metrics
Let's assume we have a simple Go microservice, checkout-service, which exposes the following Prometheus metrics via its /metrics endpoint:
* http_requests_total: A counter with labels method, path, and status_code.
* http_request_duration_seconds: A histogram with label path.
To make these metrics available to Prometheus, we define a Service and a ServiceMonitor:
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: checkout-service
labels:
app: checkout-service
spec:
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
selector:
app: checkout-service
---
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: checkout-service-monitor
labels:
release: prometheus # Assumes you use the prometheus-community helm chart
spec:
selector:
matchLabels:
app: checkout-service
endpoints:
- port: http
With this in place, Prometheus will begin scraping our application pods, providing the data needed for analysis.
Phase 1: Simple Success Rate Analysis
Our first iteration will be a basic check: the canary's HTTP success rate must be above 99%. This involves creating an AnalysisTemplate with a single PromQL query.
The `AnalysisTemplate`
This template defines a reusable check for success rate. It's parameterized using args to allow for flexibility.
# analysis-template-success-rate.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-check
spec:
args:
- name: service-name
metrics:
- name: success-rate
# We allow 3 consecutive failures before failing the entire analysis.
# This accounts for transient issues with Prometheus scraping or query evaluation.
failureLimit: 3
# We require at least 5 measurements. The analysis runs every 'interval'.
count: 5
interval: 20s
provider:
prometheus:
address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status_code=~"^2.."
}[1m]))
/
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[1m]))
# The success condition defines what a 'passing' result looks like.
# 'result[0]' refers to the first (and only) result from the query.
successCondition: result[0] > 0.99
Dissecting the Query:
* sum(rate(http_requests_total{...}[1m])): This is the standard way to calculate requests-per-second from a cumulative counter. We use rate over a 1-minute window ([1m]) to get a smoothed-out value.
* status_code=~"^2..": This regex selects only successful (2xx) status codes.
* The query calculates the ratio of the rate of successful requests to the rate of all requests.
* {{args.service-name}}: This is how Argo Rollouts injects arguments into the template, making it reusable for different services.
Integrating into the `Rollout`
Now, we modify our Rollout to use this template. The analysis block is added to the canary steps.
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-service
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 20
- pause: {}
# --- Analysis Step --- #
- analysis:
templates:
- templateName: success-rate-check
args:
- name: service-name
value: checkout-service-canary # We target the canary-specific service
# ... selector, template, etc.
With this configuration, after shifting 20% of traffic to the canary, the rollout will pause. Argo Rollouts will then create an AnalysisRun, which executes the success-rate-check template. It will query Prometheus every 20 seconds for 5 consecutive measurements. If the success rate remains above 99% for all 5 checks, the analysis succeeds, and the rollout proceeds to the next step. If it fails 3 times (failureLimit), the entire rollout is aborted and automatically rolled back.
Phase 2: Multi-Metric Analysis (Latency and Success Rate)
Success rate alone is not a complete picture of health. A service could be responding successfully but be unacceptably slow. We need to analyze latency in parallel.
Let's create a more advanced template that checks both P95 latency and success rate.
# analysis-template-latency-and-success.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: performance-check
spec:
args:
- name: service-name
- name: latency-threshold
value: "0.5" # Default to 500ms
- name: success-rate-threshold
value: "0.99"
metrics:
- name: success-rate
failureLimit: 2
provider:
prometheus:
address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[2m])) by (status_code)
/
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[2m]))
successCondition: result[0] >= {{args.success-rate-threshold}}
- name: p95-latency
failureLimit: 2
provider:
prometheus:
address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
# This query calculates p95 latency from a Prometheus histogram.
query: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{
service="{{args.service-name}}"
}[2m])) by (le))
successCondition: result[0] < {{args.latency-threshold}}
Key Improvements:
metrics entries: Argo Rollouts will evaluate both success-rate and p95-latency during each analysis interval. All metrics must meet their successCondition for the interval to be considered successful.histogram_quantile(0.95, ...) is the standard PromQL function for calculating quantiles from a histogram metric (http_request_duration_seconds_bucket).args: We've parameterized the thresholds themselves, allowing different services to specify their own SLOs when using this template.Your Rollout would now reference this new performance-check template. This provides a much stronger guarantee of canary health before promotion.
Edge Case: The "No Traffic" Problem
A subtle but critical failure mode exists in our current setup. What if, due to a service mesh misconfiguration or low overall traffic, the canary service receives zero requests during the analysis window?
* The success rate query will return NaN (division by zero), which fails the > 0.99 check.
* The latency query will return an empty result set.
Argo Rollouts treats these scenarios as Inconclusive. If the inconclusiveLimit is reached, the rollout fails. This is often the desired behavior, but we can be more explicit and provide a clearer failure reason.
We can add a dedicated metric to ensure a minimum level of traffic is present.
# analysis-template-with-traffic-check.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: robust-performance-check
spec:
args:
# ... same args as before
- name: min-requests-per-second
value: "0.1" # Require at least 1 request every 10 seconds
metrics:
- name: traffic-check
failureLimit: 1
provider:
prometheus:
address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[2m]))
# Fail fast if traffic is below the minimum threshold.
successCondition: result[0] >= {{args.min-requests-per-second}}
# ... include success-rate and p95-latency metrics from the previous example
By adding this traffic-check metric first, we ensure that subsequent, more expensive checks only run if the canary is actually being tested under a meaningful load. If it fails, the AnalysisRun will clearly state that the traffic-check metric failed, making debugging much faster.
Phase 3: The Gold Standard - Canary vs. Stable Baseline Analysis
Static thresholds (< 500ms, > 99%) are brittle. System load fluctuates. A background process could increase baseline latency across the entire cluster. In such a scenario, a perfectly healthy canary might fail its analysis simply because its latency is 510ms while the stable version is also at 510ms. The canary isn't worse; the environment has changed.
The most resilient production pattern is to compare the canary's performance directly against the stable version's performance at the same point in time. The success condition becomes relative: "The canary's error rate must not be more than 10% higher than the stable version's error rate."
To achieve this, we need to distinguish metrics from canary pods vs. stable pods. Argo Rollouts makes this possible by injecting a unique rollouts-pod-template-hash label onto every pod it manages. We can use this hash to isolate metrics.
The `AnalysisTemplate` for Baseline Comparison
This template is significantly more complex but represents a true production-grade analysis strategy.
# analysis-template-baseline.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: baseline-error-rate-check
spec:
args:
- name: service-name
- name: prometheus-app-label # The label used to select your app pods
value: "app"
# These are automatically injected by Argo Rollouts when the analysis is run
- name: stable-pod-template-hash
- name: canary-pod-template-hash
metrics:
- name: error-rate-comparison
failureLimit: 3
interval: 30s
count: 10
provider:
prometheus:
address: http://prometheus-kube-prometheus-stack-prometheus.monitoring.svc:9090
query: |
let
canary_error_rate =
sum(rate(http_requests_total{
{{args.prometheus-app-label}}="{{args.service-name}}",
rollouts-pod-template-hash="{{args.canary-pod-template-hash}}",
status_code!~"^2.."
}[2m]))
/
sum(rate(http_requests_total{
{{args.prometheus-app-label}}="{{args.service-name}}",
rollouts-pod-template-hash="{{args.canary-pod-template-hash}}"
}[2m]));
stable_error_rate =
sum(rate(http_requests_total{
{{args.prometheus-app-label}}="{{args.service-name}}",
rollouts-pod-template-hash="{{args.stable-pod-template-hash}}",
status_code!~"^2.."
}[2m]))
/
sum(rate(http_requests_total{
{{args.prometheus-app-label}}="{{args.service-name}}",
rollouts-pod-template-hash="{{args.stable-pod-template-hash}}"
}[2m]));
# Return 0 if canary rate is less than or equal to stable rate + 2%.
# Return 1 otherwise. This simplifies the success condition.
# or on() vector(0) handles cases where stable_error_rate is NaN (no traffic).
(canary_error_rate > (stable_error_rate + 0.02)) or on() vector(0)
# We succeed if the query result is 0.
successCondition: result[0] == 0
Deconstruction of this Advanced Template:
args: Argo Rollouts automatically populates {{args.stable-pod-template-hash}} and {{args.canary-pod-template-hash}} with the correct values during an AnalysisRun. This is the magic that allows us to differentiate.let clause: We use let to define two variables, canary_error_rate and stable_error_rate, for readability. Each one calculates the error rate scoped to pods with the corresponding rollouts-pod-template-hash.(canary_error_rate > (stable_error_rate + 0.02)) or on() vector(0), is the core of the check.* It checks if the canary error rate is greater than the stable error rate plus a 2% absolute margin.
* or on() vector(0): This is a crucial piece of PromQL for resiliency. If the right side of the or has no data (e.g., the stable version has no errors, resulting in NaN), the or will return the left side's result (vector(0)). This ensures the analysis doesn't fail due to a lack of errors on the stable version.
successCondition: The query is engineered to return 0 for success and 1 for failure. This makes the successCondition: result[0] == 0 clean and unambiguous.Integrating the Baseline Analysis into the `Rollout`
The Rollout spec is updated to use this new template. Notice we don't need to provide the pod-template-hash arguments; Argo injects them.
# rollout-baseline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-service
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 30s } # Initial pause to allow metrics to propagate
- analysis:
templates:
- templateName: baseline-error-rate-check
args:
- name: service-name
value: checkout-service
- name: prometheus-app-label
value: app
- setWeight: 50
- pause: { duration: 5m }
# You can run the same analysis again at a higher traffic weight
- analysis:
templates:
- templateName: baseline-error-rate-check
# ... args ...
# ... selector, template, etc.
This setup represents a mature progressive delivery pipeline. Deployments are now validated against the live, real-time performance of the existing version, making them resilient to external environmental factors and providing a very high degree of safety.
Conclusion: From Blind Faith to Data-Driven Confidence
We have journeyed from simple, static thresholding to a dynamic, baseline-aware analysis that embodies the core principles of SRE and modern DevOps. By leveraging Argo Rollouts' AnalysisTemplate CRD with carefully crafted PromQL, we can build an automated promotion gate that uses system health as its primary signal.
Key takeaways for production systems:
* Never trust sleep: Always use metric-driven analysis for canary promotions.
* Analyze multiple SLIs: A single metric like success rate is insufficient. Combine it with latency, saturation, or other business-critical indicators.
* Plan for edge cases: Explicitly check for minimum traffic to avoid inconclusive results and ensure your canaries are actually being tested.
* Prefer baseline comparison: Comparing the canary against the stable version is the most robust strategy, as it makes your deployment pipeline adaptive to real-time system conditions.
Implementing these patterns moves your CI/CD process from a simple deployment mechanism to an intelligent, self-healing system that actively protects your users and your SLOs. It is a foundational element for any organization aiming to ship features quickly and safely at scale.