Argo Rollouts AnalysisTemplates for Automated Canary Analysis
The Inadequacy of Time-Based Canary Deployments
In mature Kubernetes environments, the standard Deployment object with a RollingUpdate strategy often falls short. It primarily ensures pod availability (readinessProbe) but offers no insight into the quality of the new version. Is it introducing subtle bugs? Has latency increased? Is the error rate creeping up? These are questions that simple liveness and readiness probes cannot answer.
Canary deployments are the accepted solution, but naive implementations often rely on manual observation or, worse, time.sleep intervals. A senior engineer knows this is untenable. A 10-minute wait before promotion doesn't guarantee stability; it only guarantees a 10-minute delay. True confidence in a release comes from data-driven decisions based on the real-time performance of the canary instance under production traffic.
This is where Argo Rollouts, a progressive delivery controller for Kubernetes, fundamentally changes the game. While its traffic-shaping capabilities (integrating with service meshes like Istio or ingress controllers like NGINX) are well-known, its most powerful feature for production hardening is the AnalysisRun and AnalysisTemplate system. This system allows you to define declarative, metric-driven quality gates that automatically promote or roll back a canary based on data from monitoring systems like Prometheus, Datadog, or even custom webhooks.
This post will bypass the basics of setting up Argo Rollouts and focus exclusively on designing and implementing robust AnalysisTemplates for automated, production-grade canary analysis.
Core Architecture: Rollouts, AnalysisRuns, and Templates
To understand the implementation, we must first be precise about the components involved:
Rollout: A Custom Resource Definition (CRD) provided by Argo Rollouts that replaces the standard Deployment object. It defines the application spec (just like a Deployment) but adds a highly configurable strategy section for canaries or blue-green deployments.AnalysisTemplate: A reusable, parameterized template that defines what to measure and what constitutes success. It specifies the metrics to query (e.g., PromQL queries), the expected results (successCondition), and how often to measure.AnalysisRun: An instantiation of an AnalysisTemplate created by the Rollout controller at a specific step in the deployment process. It executes the metric queries and reports a Successful, Failed, or Inconclusive status back to the Rollout.The workflow is as follows:
Rollout manifest and applies it (kubectl apply -f rollout.yaml).- The Rollout controller creates a new ReplicaSet for the canary version.
steps defined in the canary strategy. For example, it might first shift 5% of traffic to the canary.analysis step, it pauses the rollout.AnalysisRun resource from the specified AnalysisTemplate, injecting any arguments.AnalysisRun controller takes over, running a job that queries the metric provider (e.g., Prometheus) at a defined interval.successCondition.AnalysisRun is successful, the Rollout controller unpauses and proceeds to the next step (e.g., increasing traffic to 25% or full promotion).AnalysisRun fails, the Rollout controller immediately aborts the update and rolls back to the previous stable version.This declarative, automated feedback loop is the foundation of a safe and reliable continuous delivery pipeline.
Production Implementation: Error Rate Analysis with Prometheus
Let's build a practical, production-ready scenario. We have a microservice that exposes an HTTP endpoint. We want to ensure that any new version does not increase the HTTP 5xx error rate beyond a 1% threshold.
Prerequisites:
* A Kubernetes cluster with Argo Rollouts controller installed.
* Prometheus installed and scraping metrics from your application pods. Your application must expose a metric like http_requests_total with status_code and path labels.
Step 1: Defining the `AnalysisTemplate`
First, we create the reusable logic for our error rate check. We'll define an AnalysisTemplate that takes the service name and namespace as arguments to make it generic.
# analysis-template-error-rate.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: http-error-rate-check
spec:
args:
- name: service-name
- name: namespace
metrics:
- name: error-rate
# We'll tolerate one failed measurement to account for transient network issues or metric scrape lag.
failureLimit: 1
# Run the query every 20 seconds.
interval: 20s
# Run the analysis for a total of 5 measurements (100 seconds total).
# This ensures we have a statistically significant sample and ride out initial fluctuations.
count: 5
provider:
prometheus:
address: http://prometheus-kube-prometheus-stack-prometheus.monitoring:9090
# This PromQL query calculates the ratio of 5xx responses to total responses over the last minute.
# We use `{{args.service-name}}` to dynamically insert the service name from the Rollout.
query: |
sum(rate(http_requests_total{app="{{args.service-name}}", namespace="{{args.namespace}}", status_code=~"5.."}[1m]))
/
sum(rate(http_requests_total{app="{{args.service-name}}", namespace="{{args.namespace}}"}[1m]))
# The success condition is the core of our quality gate.
# `result[0]` refers to the first (and only) value returned by the PromQL query.
# We use `isNan` to handle the edge case of zero total traffic, which would result in NaN (0/0).
# If there's no traffic, we consider it a success to avoid blocking deployments in low-traffic environments.
# The primary condition is that the error rate must be less than or equal to 0.01 (1%).
successCondition: isNil(result) || isNan(result[0]) || result[0] <= 0.01
Key Implementation Details:
* failureLimit: 1: This is crucial for production. Metric systems can have transient failures. This setting allows one measurement to fail without failing the entire AnalysisRun. The run will only fail if it exceeds this limit.
count: 5, interval: 20s: The duration of the analysis (count interval) must be long enough to be meaningful. A 10-second analysis is useless. 100 seconds gives us time to collect data and smooth out brief spikes. This duration should be longer than your Prometheus scrape interval to ensure fresh data is being evaluated.
* isNan(result[0]): This handles the divide-by-zero edge case. If a canary receives zero traffic during the analysis window, the PromQL query 0/0 returns NaN. Without this check, the AnalysisRun would fail. By treating NaN as a success, we prevent deployments from being blocked due to a lack of traffic.
* Parameterization with args: Using {{args.service-name}} makes this template reusable across dozens of microservices, enforcing a consistent quality gate.
Step 2: Integrating the Template into a `Rollout`
Now, let's define a Rollout for our application that uses this AnalysisTemplate.
# my-app-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
namespace: production
spec:
replicas: 5
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
# The initial image
image: my-registry/my-app:1.0.0
ports:
- containerPort: 8080
strategy:
canary:
# These steps define the progressive delivery process.
steps:
# 1. Start by sending 10% of traffic to the canary.
- setWeight: 10
# 2. Pause the rollout indefinitely. The pause will be lifted only by a successful analysis or manual intervention.
- pause: {}
# 3. After the initial pause (allowing for pod startup and traffic flow), run our analysis.
- analysis:
templateName: http-error-rate-check
args:
- name: service-name
value: my-app
- name: namespace
value: production
# 4. If analysis succeeds, ramp up traffic to 50%.
- setWeight: 50
# 5. Pause for a shorter duration for bake-in time under higher load.
- pause: { duration: 5m }
# 6. If all is well, full promotion occurs by setting weight to 100%. The old ReplicaSet will be scaled down.
Dissecting the Strategy:
* setWeight: 10: We start with a small blast radius. Only 10% of users are exposed to the new version.
pause: {}: This is a critical step. We pause before* the analysis. This allows the new canary pods to start, warm up, and begin serving traffic. Without this, the analysis might start before any meaningful traffic has reached the canary.
* analysis: This block is the core of the integration. It references our http-error-rate-check template by name and provides the necessary arguments. The Rollout will not proceed past this step until the corresponding AnalysisRun completes successfully.
* Implicit Promotion/Rollback: Notice there's no if/else block. The Rollout controller's logic is implicit: if any step fails (including the analysis step), the entire update is considered failed, and an automatic, immediate rollback to the last stable ReplicaSet is initiated. If all steps succeed, the new ReplicaSet is marked as stable.
Step 3: Observing the Deployment in Action
When you deploy a new image (my-registry/my-app:1.1.0), you can observe the process:
kubectl argo rollouts get rollout my-app -n production -w
You will see the status change from Healthy to Progressing, and the steps will be executed in order. You'll see it pause at the analysis step.
The rollout controller creates an AnalysisRun object. You can inspect it to see the live results.
# Find the generated AnalysisRun
kubectl get analysisrun -n production
# Describe it to see measurements
kubectl describe analysisrun <generated-analysis-run-name> -n production
The description will show each metric measurement, its value, and whether it was successful or failed.
Simulating a Failure:
Imagine version 1.1.0 has a bug causing it to return a 5% error rate. The AnalysisRun will proceed as follows:
* Measurement 1: result: [0.05], Phase: Running
* Measurement 2: result: [0.05], Phase: Running
* ... and so on.
Since 0.05 is not <= 0.01, every measurement will fail. The AnalysisRun will be marked as Failed. The Rollout controller sees this, immediately aborts the update, and scales the new ReplicaSet down to zero, shifting all traffic back to the stable 1.0.0 version. You have successfully prevented a production incident without any human intervention.
Advanced Patterns and Edge Case Handling
Real-world systems are more complex than a single error rate metric.
Pattern 1: Multi-Metric Analysis for Latency and Errors
You should always analyze multiple key indicators (the Four Golden Signals are a good starting point). Let's enhance our template to check both P99 latency and error rate.
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
name: standard-api-quality-gate
spec:
args:
- name: service-name
- name: namespace
metrics:
- name: error-rate
failureLimit: 1
interval: 30s
count: 5
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{app="{{args.service-name}}", namespace="{{args.namespace}}", status_code=~"5.."}[2m]))
/
sum(rate(http_requests_total{app="{{args.service-name}}", namespace="{{args.namespace}}"}[2m]))
successCondition: isNil(result) || isNan(result[0]) || result[0] <= 0.01
- name: p99-latency
failureLimit: 1
interval: 30s
count: 5
provider:
prometheus:
address: http://prometheus.monitoring:9090
# Query for the 99th percentile latency over the last 5 minutes.
query: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app="{{args.service-name}}", namespace="{{args.namespace}}"}[5m])) by (le))
# Success if P99 latency is below 300ms.
successCondition: isNil(result) || result[0] <= 0.3
Key Changes:
* ClusterAnalysisTemplate: We've promoted this to a ClusterAnalysisTemplate. This is a non-namespaced resource, allowing any Rollout in any namespace to reference it. This is a best practice for platform teams wanting to enforce a standard quality gate across the organization.
Multiple metrics entries: The AnalysisRun will now evaluate both queries concurrently. The entire run is only successful if all* metrics meet their respective successCondition for the required number of measurements.
Pattern 2: Custom Checks with Webhook Providers
Sometimes, metrics aren't enough. You might need to run an integration test suite, query a third-party service, or perform a complex business logic check.
The web provider is perfect for this.
The AnalysisTemplate:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: e2e-test-runner
spec:
args:
- name: canary-host
metrics:
- name: e2e-tests
provider:
web:
# This URL points to a service that can run our tests.
url: http://e2e-test-service.ci/run-tests
method: POST
headers:
- key: Content-Type
value: application/json
body: |-
{
"targetHost": "{{args.canary-host}}"
}
# The test service must return a JSON body with a list of strings in a `status` field.
# We declare success if the list contains the string "AllTestsPassed".
jsonPath: "{.status}"
successCondition: '"AllTestsPassed" in result'
The Webhook Service Implementation (Example in Go):
Your test runner service must be designed to return a specific JSON format that Argo Rollouts can parse with jsonPath.
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
)
type TestRequest struct {
TargetHost string `json:"targetHost"`
}
type TestResponse struct {
Status []string `json:"status"`
}
func runTestsHandler(w http.ResponseWriter, r *http.Request) {
var req TestRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
log.Printf("Running tests against: %s", req.TargetHost)
// --- DUMMY TEST LOGIC ---
// In a real implementation, you would trigger a test suite (e.g., Playwright, k6)
// against the req.TargetHost and check the results.
success := true // Assume success for this example
// --- END DUMMY LOGIC ---
resp := TestResponse{}
if success {
resp.Status = append(resp.Status, "AllTestsPassed")
} else {
resp.Status = append(resp.Status, "TestsFailed")
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(resp)
}
func main() {
http.HandleFunc("/run-tests", runTestsHandler)
fmt.Println("E2E Test Service listening on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
This pattern provides infinite flexibility, allowing you to integrate virtually any external system or custom logic into your deployment pipeline as a quality gate.
Edge Case: The Cold Start / Low Traffic Problem
A common failure mode for canary analysis is insufficient traffic. If the canary receives no requests, error rate and latency metrics will be empty or NaN. While our isNan check prevents a false failure, it also provides a false positive: the canary is promoted without being properly vetted.
Solutions:
Rollout steps, you can add a job step that runs a Kubernetes Job to generate synthetic load (using a tool like k6 or hey) against the canary endpoint before the analysis step begins. # In Rollout strategy.canary.steps
- setWeight: 10
- pause: {}
- run: # This is a new step type not yet in Argo, but illustrates the concept.
# In reality, you'd use a Kubernetes Job or a webhook to trigger this.
command: k6 run --vus 10 --duration 2m http://my-app-canary
- analysis: # ... now run analysis
successCondition to handle an empty result as inconclusive or failed. (sum(rate(http_requests_total{...}[1m])) > 1) * (
sum(rate(http_requests_total{..., status_code=~"5.."}[1m]))
/
sum(rate(http_requests_total{...}[1m]))
)
This query will return an empty result if the total RPS is not greater than 1, preventing analysis on insignificant traffic.
Conclusion: From Deployments to Progressive Delivery
Implementing AnalysisTemplates in Argo Rollouts is a significant step in maturing a software delivery practice. It moves the team away from the fragile, hope-based model of RollingUpdate towards a robust, data-driven model of progressive delivery. By codifying quality gates as declarative Kubernetes resources, you create a system that is repeatable, auditable, and fully automated.
The patterns discussed here—parameterized templates, multi-metric analysis, webhook integration, and handling low-traffic scenarios—are not theoretical. They are battle-tested strategies used by high-performing engineering organizations to de-risk their production releases. By automating the analysis of canaries, you free up senior engineers from the tedious and error-prone task of manual verification, allowing them to focus on building features while the platform itself guarantees a high bar for stability and performance.