Metric-Driven Progressive Delivery with Flagger, Istio, and Prometheus
The Fragility of Time-Based Canary Deployments
For any senior engineer operating services on Kubernetes, canary deployments are standard practice. The conventional approach—shifting a small percentage of traffic (e.g., 5%) to a new version, waiting for a fixed duration, and then manually promoting if no alerts fire—is a significant improvement over monolithic 'big bang' releases. However, this model is fundamentally passive and fraught with potential blind spots.
It relies on coarse-grained observation. A 5-minute wait might miss a slow memory leak or a latency degradation that only becomes apparent under specific load patterns. It often depends on existing alerting systems that are tuned for catastrophic failure (e.g., >5% error rate), not subtle performance regressions. Promoting a canary becomes a subjective judgment call based on incomplete data, undermining the goal of a truly automated, reliable pipeline.
This is where metric-driven progressive delivery fundamentally changes the game. Instead of a passive waiting period, we actively query a set of key performance indicators (KPIs) from both the stable (primary) and candidate (canary) deployments. The rollout progresses incrementally only if the canary's performance remains within an acceptable delta of the primary's. If any metric—be it request latency, error rate, or a critical business metric—degrades beyond a defined threshold, the rollout is automatically halted and rolled back. This post details the architecture and implementation of such a system using Flagger, Istio, and Prometheus.
Assumed Environment
This article assumes you are operating a Kubernetes cluster with the following components already installed and configured:
* Kubernetes: v1.21+
* Istio: v1.12+ (with sidecar injection enabled for the target namespace)
* Prometheus Operator: Deployed and configured to scrape pods in the target namespace.
* Flagger: Installed in its own namespace.
We will not cover the installation of these components, as we are focused on the advanced integration patterns.
Solution Architecture: A Closed-Loop System
Our system creates a closed-loop feedback mechanism where deployment decisions are driven by real-time application performance.
app:v2.0.0) and updates a Kubernetes Deployment manifest.Deployment objects that it is configured to manage.Deployment, Flagger creates a new Deployment for the canary version (app-canary:v2.0.0) and scales down the primary's replica count slightly if necessary.VirtualService and DestinationRule custom resources. Initially, it configures the VirtualService to send 0% of traffic to the canary, but ensures it's reachable for testing. * It increases the traffic weight to the canary by a configured stepWeight (e.g., 10%).
* It waits for a configured interval (e.g., 1 minute).
* During this interval, it executes a series of pre-defined Prometheus queries (PromQL) against both the primary and canary workloads.
It compares the results against a threshold. For example, p99_latency_canary <= p99_latency_primary 1.1.
* Success: If all metric checks pass, Flagger repeats the loop, increasing the traffic weight until it reaches the maxWeight (e.g., 50%).
* Failure: If any metric check fails more times than the configured threshold, Flagger immediately reverts the VirtualService to send 100% of traffic back to the primary, scales down the canary Deployment to zero, and marks the rollout as failed.
maxWeight, Flagger promotes the canary. It updates the primary Deployment with the new image version, waits for it to become healthy, and then safely removes the canary Deployment and cleans up the Istio routing rules.This entire process is automated, data-driven, and significantly reduces the risk of deploying a faulty version.
Deep Dive: The Flagger `Canary` Resource
The core of our implementation is the Flagger Canary Custom Resource Definition (CRD). This resource declaratively defines the entire progressive delivery strategy for a specific application. Let's break down a sophisticated example.
# podinfo-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: podinfo
namespace: test
spec:
# Reference to the target deployment
targetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
# The service that fronts the deployment
service:
port: 9898
# Optional: If the service port name is different from the port number
# targetPort: 9898
# The core of the progressive delivery strategy
analysis:
# Run analysis every 1 minute
interval: 1m
# Abort after 10 failed checks
threshold: 10
# Max traffic weight to shift to the canary (50%)
maxWeight: 50
# Traffic weight increment step (10%)
stepWeight: 10
# Metrics to check during analysis
metrics:
- name: request-success-rate
# Minimum success rate (99%)
thresholdRange:
min: 99
interval: 1m
- name: request-duration
# Check against a custom MetricTemplate (defined separately)
templateRef:
name: latency-p99
namespace: istio-system
# p99 latency should not exceed 500ms
thresholdRange:
max: 500
interval: 30s
- name: 'custom-checkout-success-rate'
templateRef:
name: checkout-success-rate
namespace: test
# Checkout success rate must not drop below 99.5% of primary
thresholdRange:
min: 99.5 # This is a percentage of the primary's value
interval: 1m
# Webhooks for integration with other systems
webhooks:
- name: "load-test"
type: pre-rollout
url: http://flagger-loadtester.test/ # Trigger a load test
timeout: 5m
metadata:
type: "bash"
cmd: "curl -s http://podinfo-canary.test:9898/healthz"
- name: "acceptance-test"
type: rollout
url: http://gatekeeper.test/approve
# This webhook will be called at each step of the analysis
# and can be used to run integration tests
- name: "slack-notification"
type: event # was post-rollout or event
url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
metadata:
channel: "#releases"
user: "Flagger"
Key `analysis` Fields Explained:
* interval: The duration of each analysis step. A shorter interval provides faster feedback but can be susceptible to noise. A longer interval smooths out metrics but slows down the release. 1m is a reasonable starting point.
* threshold: The number of consecutive failed metric checks before triggering a rollback. This prevents a single anomalous data point from failing the entire deployment. A value between 5 and 10 is common.
* stepWeight / maxWeight: These control the traffic ramp-up. stepWeight: 10 and maxWeight: 50 means the analysis will run at 10%, 20%, 30%, 40%, and 50% traffic. This provides multiple checkpoints to catch issues before they impact a majority of users.
* metrics: This is where we define our KPIs. Notice the mix of a standard metric (request-success-rate) and two more advanced metrics defined by templateRef.
webhooks: This enables powerful integrations. The pre-rollout webhook can trigger a synthetic load test against the canary before* any real user traffic is shifted. The event webhook provides real-time notifications about the deployment's progress.
Crafting Custom Metric Templates for Deep Analysis
Flagger's real power is unlocked when you move beyond default metrics and define your own using MetricTemplate resources. These templates contain the raw PromQL queries that extract the precise signals you care about.
Scenario 1: Analyzing p99 Latency
Average latency can be misleading. Averages can hide significant tail latency issues where a small percentage of users experience extreme slowdowns. Analyzing the 99th percentile (p99) latency gives a much clearer picture of the worst-case user experience.
Istio's default Prometheus metrics include a histogram called istio_request_duration_milliseconds_bucket. We can use this to calculate percentiles.
Here is the MetricTemplate to calculate p99 latency:
# latency-metric-template.yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: latency-p99
namespace: istio-system # Best to keep Istio-related templates here
spec:
provider:
type: prometheus
address: http://prometheus.istio-system.svc.cluster.local:9090
query: |-
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
destination_workload_namespace="{{ namespace }}",
destination_workload=~"{{ target }}.*",
destination_app="{{ target }}"
}[{{ interval }}])) by (le)
)
Dissecting the PromQL Query:
* histogram_quantile(0.99, ...): The core Prometheus function for calculating percentiles from a histogram.
* sum(rate(...)) by (le): This is the standard pattern for working with histograms. We calculate the rate of observations over our specified interval and sum them up, preserving the histogram bucket (le) label.
destination_workload=~"{{ target }}.": This is crucial. Flagger replaces {{ target }} with the name of the deployment (podinfo for the primary, podinfo-canary for the canary). The regex match .* ensures we select both.
* {{ namespace }} and {{ interval }}: These are variables that Flagger will substitute at runtime with values from the Canary resource, making the template reusable.
When this template is referenced in the Canary analysis, Flagger runs the query for both the primary and canary and compares the results against the thresholdRange. In our example, if the p99 latency of the canary exceeds 500ms, the check fails.
Scenario 2: Analyzing a Business-Specific Metric
Technical metrics are essential, but what if a new deployment subtly breaks a critical business workflow? For example, a change might not generate HTTP 500 errors but could cause the checkout process in an e-commerce application to fail silently.
To guard against this, we must instrument our application to expose custom business metrics.
Application Instrumentation (Example in Go with Prometheus client):
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
checkoutEvents = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "app_checkout_events_total",
Help: "Total number of checkout events by status.",
},
[]string{"status"}, // 'success' or 'failure'
)
)
func handleCheckout(w http.ResponseWriter, r *http.Request) {
// ... complex checkout logic ...
success := performCheckoutLogic()
if success {
checkoutEvents.WithLabelValues("success").Inc()
w.WriteHeader(http.StatusOK)
w.Write([]byte("Checkout successful!"))
} else {
checkoutEvents.WithLabelValues("failure").Inc()
// Still return a 200 OK because the system itself didn't error,
// but the business logic failed. This is a key scenario.
w.WriteHeader(http.StatusOK)
w.Write([]byte("Checkout failed. Please try again."))
}
}
Now, our application exposes a metric app_checkout_events_total. We can write a MetricTemplate to calculate the success rate.
# checkout-metric-template.yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: checkout-success-rate
namespace: test # Keep application-specific templates in the app's namespace
spec:
provider:
type: prometheus
address: http://prometheus-operated.monitoring.svc.cluster.local:9090
query: |-
(sum(rate(app_checkout_events_total{
namespace="{{ namespace }}",
pod=~"^{{ target }}.*$",
status="success"
}[{{ interval }}]))
/
sum(rate(app_checkout_events_total{
namespace="{{ namespace }}",
pod=~"^{{ target }}.*$"
}[{{ interval }}]))
* 100
Query Dissection:
This query calculates the ratio of successful checkouts to total checkouts and multiplies by 100 to get a percentage. The key is pod=~"^{{ target }}.*$" which correctly selects the pods belonging to either the primary or canary deployment based on their naming convention.
By referencing this template in our Canary object, we've now tied our deployment's safety directly to a critical business outcome. The thresholdRange: { min: 99.5 } in the Canary spec means that the canary's checkout success rate must be at least 99.5% of the primary's success rate. This is a relative check, which is incredibly powerful for noisy metrics. It's not about hitting an absolute number; it's about not being worse than the current baseline.
Implementation Walkthrough: A Failed Deployment
Let's simulate a scenario where we deploy a new version of our podinfo service that introduces a 600ms latency delay.
1. Initial State:
* Deployment podinfo is running app:v1.0.0.
* Service podinfo routes to this deployment.
* Istio VirtualService podinfo has a single route to the podinfo service subset.
* Our Canary podinfo and MetricTemplates are applied.
2. Triggering the Deployment:
We update the podinfo Deployment's container image to app:v2.0.0 (our faulty version).
kubectl set image deployment/podinfo podinfo=ghcr.io/stefanprodan/podinfo:6.0.1 (assuming 6.0.1 is the bad version)
3. Flagger Takes Control:
Flagger detects the change. Looking at the Flagger logs (kubectl logs -f deployment/flagger -n flagger), we'll see the process unfold:
INFO podinfo.test - New revision detected! Scaling up podinfo.test
INFO podinfo.test - Starting canary analysis for podinfo.test
INFO podinfo.test - Pre-rollout check acceptance-test passed
INFO podinfo.test - Advance podinfo.test canary weight 10
Flagger has created podinfo-canary and updated the VirtualService to send 10% of traffic to it.
4. Analysis Begins:
After the first interval (1 minute), Flagger runs its metric checks.
INFO podinfo.test - Running analysis for podinfo.test
INFO podinfo.test - request-success-rate: 100.00 >= 99
WARN podinfo.test - request-duration: 615.45 > 500
WARN podinfo.test - Canary podinfo.test failed checks: request-duration
The request-duration check, using our p99 latency template, has failed. The measured latency (615ms) is above our thresholdRange.max of 500ms.
5. Iterative Failures and Rollback:
Flagger will continue the analysis. Since the high latency is persistent, the check will fail repeatedly. The threshold is set to 10. After 10 consecutive failures:
... (9 more failure logs)
ERROR podinfo.test - Canary failed! Scaling down podinfo.test
INFO podinfo.test - Rolling back podinfo.test.test failed checks threshold reached 10
INFO podinfo.test - Halting podinfo.test advancement
INFO podinfo.test - Promotion failed!
Flagger immediately performs the rollback:
* It reconfigures the Istio VirtualService to send 100% of traffic back to the podinfo primary Deployment.
* It scales the podinfo-canary Deployment down to zero replicas.
The entire incident was contained. Only a small percentage of users were exposed to the degraded performance, and the system automatically healed itself without any human intervention.
Advanced Patterns and Production Edge Cases
While the core loop is powerful, real-world systems present more complexity.
Edge Case 1: Coordinating with Database Migrations
A common challenge is deploying an application change that depends on a database schema migration. A naive canary deployment can be catastrophic: the new v2 code expects a new column, but 90% of traffic is still hitting v1 code which will break if the column is added. The old v1 code will break if the column is removed.
Solution: Multi-Stage Release with Feature Flags and Flagger Webhooks
This requires an expand/contract pattern, orchestrated by your CI/CD system and Flagger:
v2 code, feature-flagged off. Deploy the new application code (v2) but have it behave identically to v1 via a feature flag. The new code paths are present but disabled. Run a full Flagger progressive delivery. The goal is to verify that the new binary is stable with the old schema.v1 code must ignore it and continue to function.v3 Deployment. Deploy a new version (v3, which is identical to v2 but with the feature flag enabled by default via configuration). Trigger another Flagger progressive delivery. During this phase, Flagger will be analyzing the performance of the new code paths interacting with the new schema.v3 is fully rolled out and stable, you can deploy a v4 that removes the old code paths and the feature flag logic. In a separate maintenance window, run a contractive migration to remove the old database columns.Flagger's webhooks can be used to gate these phases. For example, a post-rollout webhook from Phase 1 could trigger the CI job for Phase 2.
Edge Case 2: A/B Testing vs. Canary Analysis
Sometimes you want to split traffic for A/B testing, not for safety analysis. For example, you want to send all users from Canada to a new version to test a feature.
Solution: Header-Based Routing with Istio
Flagger supports this directly. You can add match and mirror clauses to your Canary spec.
# A/B testing spec fragment
spec:
analysis:
# ... standard canary analysis ...
# A/B test routing
iterations: 10
match:
- headers:
x-user-geo:
exact: "canada"
mirror:
- weight: 100
In this setup, Flagger will configure the VirtualService to route any request with the header x-user-geo: canada to the canary. The mirror directive can also be used to send a copy of the traffic for dark launching. The standard stepWeight analysis will run concurrently for the rest of the traffic. This allows you to perform targeted A/B tests within the same safe, progressive delivery framework.
Edge Case 3: Handling Noisy or Low-Volume Metrics
For a low-traffic service, a single failed request can drop the success rate from 100% to 50%, causing a false-positive rollback. Similarly, latency can be spiky.
Solutions:
analysis.interval: A longer interval (e.g., 5m) provides a larger sample size, smoothing out the data.MetricTemplate to use functions like avg_over_time() to smooth out spiky metrics.analysis.threshold: A higher threshold requires the metric to be consistently failing for a longer period before rollback, making the system less sensitive to transient blips.pre-rollout webhook to trigger a load test. This guarantees a minimum level of traffic to the canary, ensuring you have enough data for a statistically significant analysis.Performance and Resource Considerations
This powerful automation does not come for free.
* Istio Sidecar Overhead: The istio-proxy sidecar injected into each of your application pods consumes CPU and memory. This is typically in the range of 0.1-0.5 vCPU and 50-100MB RAM per pod under load. The impact on latency is usually low (a few milliseconds) but should be measured for your specific workload.
* Prometheus Footprint: A production-grade Prometheus setup can be resource-intensive, particularly in terms of memory and disk I/O, as it scales with metric cardinality and scrape frequency.
* Flagger Controller: Flagger itself is lightweight, typically consuming minimal resources.
For most modern applications, the safety and automation benefits far outweigh the resource cost. The key is to properly capacity-plan your cluster and monitor the monitoring stack itself.
Conclusion: From Release Engineering to Release Science
Implementing a progressive delivery system with Flagger, Istio, and Prometheus is a significant step in maturing a DevOps practice. It shifts the release process from a manual, anxiety-ridden engineering task to an automated, data-driven scientific process. By defining success criteria as a set of observable, quantifiable metrics, you create a resilient, self-healing deployment pipeline that catches not only catastrophic failures but also subtle performance regressions before they impact the majority of your users.
This approach requires a deeper investment in observability and a more declarative mindset, but the payoff is faster, safer, and more frequent releases—a cornerstone of high-performing engineering organizations.