Metric-Driven Progressive Delivery with Flagger, Istio, and Prometheus

October 15, 2025

18 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Fragility of Time-Based Canary Deployments

For any senior engineer operating services on Kubernetes, canary deployments are standard practice. The conventional approach—shifting a small percentage of traffic (e.g., 5%) to a new version, waiting for a fixed duration, and then manually promoting if no alerts fire—is a significant improvement over monolithic 'big bang' releases. However, this model is fundamentally passive and fraught with potential blind spots.

It relies on coarse-grained observation. A 5-minute wait might miss a slow memory leak or a latency degradation that only becomes apparent under specific load patterns. It often depends on existing alerting systems that are tuned for catastrophic failure (e.g., >5% error rate), not subtle performance regressions. Promoting a canary becomes a subjective judgment call based on incomplete data, undermining the goal of a truly automated, reliable pipeline.

This is where metric-driven progressive delivery fundamentally changes the game. Instead of a passive waiting period, we actively query a set of key performance indicators (KPIs) from both the stable (primary) and candidate (canary) deployments. The rollout progresses incrementally only if the canary's performance remains within an acceptable delta of the primary's. If any metric—be it request latency, error rate, or a critical business metric—degrades beyond a defined threshold, the rollout is automatically halted and rolled back. This post details the architecture and implementation of such a system using Flagger, Istio, and Prometheus.

Assumed Environment

This article assumes you are operating a Kubernetes cluster with the following components already installed and configured:

* Kubernetes: v1.21+

* Istio: v1.12+ (with sidecar injection enabled for the target namespace)

* Prometheus Operator: Deployed and configured to scrape pods in the target namespace.

* Flagger: Installed in its own namespace.

We will not cover the installation of these components, as we are focused on the advanced integration patterns.

Solution Architecture: A Closed-Loop System

Our system creates a closed-loop feedback mechanism where deployment decisions are driven by real-time application performance.

Developer Commits Code: A CI pipeline builds a new container image (app:v2.0.0) and updates a Kubernetes Deployment manifest.

Flagger Intercepts: Flagger's controller watches for changes to Deployment objects that it is configured to manage.

Canary Initialization: Instead of updating the primary Deployment, Flagger creates a new Deployment for the canary version (app-canary:v2.0.0) and scales down the primary's replica count slightly if necessary.

Istio Traffic Shifting: Flagger manipulates Istio's VirtualService and DestinationRule custom resources. Initially, it configures the VirtualService to send 0% of traffic to the canary, but ensures it's reachable for testing.

Progressive Analysis Loop: Flagger enters its main analysis loop:

* It increases the traffic weight to the canary by a configured stepWeight (e.g., 10%).

* It waits for a configured interval (e.g., 1 minute).

* During this interval, it executes a series of pre-defined Prometheus queries (PromQL) against both the primary and canary workloads.

It compares the results against a threshold. For example, p99_latency_canary <= p99_latency_primary 1.1.

Decision Point:

* Success: If all metric checks pass, Flagger repeats the loop, increasing the traffic weight until it reaches the maxWeight (e.g., 50%).

* Failure: If any metric check fails more times than the configured threshold, Flagger immediately reverts the VirtualService to send 100% of traffic back to the primary, scales down the canary Deployment to zero, and marks the rollout as failed.

Promotion: After successfully passing all analysis steps up to the maxWeight, Flagger promotes the canary. It updates the primary Deployment with the new image version, waits for it to become healthy, and then safely removes the canary Deployment and cleans up the Istio routing rules.

This entire process is automated, data-driven, and significantly reduces the risk of deploying a faulty version.

Deep Dive: The Flagger `Canary` Resource

The core of our implementation is the Flagger Canary Custom Resource Definition (CRD). This resource declaratively defines the entire progressive delivery strategy for a specific application. Let's break down a sophisticated example.

yaml

# podinfo-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # Reference to the target deployment
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  
  # The service that fronts the deployment
  service:
    port: 9898
    # Optional: If the service port name is different from the port number
    # targetPort: 9898 

  # The core of the progressive delivery strategy
  analysis:
    # Run analysis every 1 minute
    interval: 1m
    # Abort after 10 failed checks
    threshold: 10
    # Max traffic weight to shift to the canary (50%)
    maxWeight: 50
    # Traffic weight increment step (10%)
    stepWeight: 10

    # Metrics to check during analysis
    metrics:
      - name: request-success-rate
        # Minimum success rate (99%)
        thresholdRange:
          min: 99
        interval: 1m

      - name: request-duration
        # Check against a custom MetricTemplate (defined separately)
        templateRef:
          name: latency-p99
          namespace: istio-system
        # p99 latency should not exceed 500ms
        thresholdRange:
          max: 500
        interval: 30s

      - name: 'custom-checkout-success-rate'
        templateRef:
          name: checkout-success-rate
          namespace: test
        # Checkout success rate must not drop below 99.5% of primary
        thresholdRange:
          min: 99.5 # This is a percentage of the primary's value
        interval: 1m

    # Webhooks for integration with other systems
    webhooks:
      - name: "load-test"
        type: pre-rollout
        url: http://flagger-loadtester.test/ # Trigger a load test
        timeout: 5m
        metadata:
          type: "bash"
          cmd: "curl -s http://podinfo-canary.test:9898/healthz"
      - name: "acceptance-test"
        type: rollout
        url: http://gatekeeper.test/approve
        # This webhook will be called at each step of the analysis
        # and can be used to run integration tests
      - name: "slack-notification"
        type: event # was post-rollout or event
        url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
        metadata:
          channel: "#releases"
          user: "Flagger"

Key `analysis` Fields Explained:

* interval: The duration of each analysis step. A shorter interval provides faster feedback but can be susceptible to noise. A longer interval smooths out metrics but slows down the release. 1m is a reasonable starting point.

* threshold: The number of consecutive failed metric checks before triggering a rollback. This prevents a single anomalous data point from failing the entire deployment. A value between 5 and 10 is common.

* stepWeight / maxWeight: These control the traffic ramp-up. stepWeight: 10 and maxWeight: 50 means the analysis will run at 10%, 20%, 30%, 40%, and 50% traffic. This provides multiple checkpoints to catch issues before they impact a majority of users.

* metrics: This is where we define our KPIs. Notice the mix of a standard metric (request-success-rate) and two more advanced metrics defined by templateRef.

webhooks: This enables powerful integrations. The pre-rollout webhook can trigger a synthetic load test against the canary before* any real user traffic is shifted. The event webhook provides real-time notifications about the deployment's progress.

Crafting Custom Metric Templates for Deep Analysis

Flagger's real power is unlocked when you move beyond default metrics and define your own using MetricTemplate resources. These templates contain the raw PromQL queries that extract the precise signals you care about.

Scenario 1: Analyzing p99 Latency

Average latency can be misleading. Averages can hide significant tail latency issues where a small percentage of users experience extreme slowdowns. Analyzing the 99th percentile (p99) latency gives a much clearer picture of the worst-case user experience.

Istio's default Prometheus metrics include a histogram called istio_request_duration_milliseconds_bucket. We can use this to calculate percentiles.

Here is the MetricTemplate to calculate p99 latency:

yaml

# latency-metric-template.yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: latency-p99
  namespace: istio-system # Best to keep Istio-related templates here
spec:
  provider:
    type: prometheus
    address: http://prometheus.istio-system.svc.cluster.local:9090
  
  query: |-
    histogram_quantile(0.99, 
      sum(rate(istio_request_duration_milliseconds_bucket{ 
        reporter="destination",
        destination_workload_namespace="{{ namespace }}",
        destination_workload=~"{{ target }}.*",
        destination_app="{{ target }}"
      }[{{ interval }}])) by (le)
    )

Dissecting the PromQL Query:

* histogram_quantile(0.99, ...): The core Prometheus function for calculating percentiles from a histogram.

* sum(rate(...)) by (le): This is the standard pattern for working with histograms. We calculate the rate of observations over our specified interval and sum them up, preserving the histogram bucket (le) label.

destination_workload=~"{{ target }}.": This is crucial. Flagger replaces {{ target }} with the name of the deployment (podinfo for the primary, podinfo-canary for the canary). The regex match .* ensures we select both.

* {{ namespace }} and {{ interval }}: These are variables that Flagger will substitute at runtime with values from the Canary resource, making the template reusable.

When this template is referenced in the Canary analysis, Flagger runs the query for both the primary and canary and compares the results against the thresholdRange. In our example, if the p99 latency of the canary exceeds 500ms, the check fails.

Scenario 2: Analyzing a Business-Specific Metric

Technical metrics are essential, but what if a new deployment subtly breaks a critical business workflow? For example, a change might not generate HTTP 500 errors but could cause the checkout process in an e-commerce application to fail silently.

To guard against this, we must instrument our application to expose custom business metrics.

Application Instrumentation (Example in Go with Prometheus client):

package main

import (
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
)

var (
	checkoutEvents = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "app_checkout_events_total",
			Help: "Total number of checkout events by status.",
		},
		[]string{"status"}, // 'success' or 'failure'
	)
)

func handleCheckout(w http.ResponseWriter, r *http.Request) {
    // ... complex checkout logic ...
    
    success := performCheckoutLogic()

    if success {
        checkoutEvents.WithLabelValues("success").Inc()
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("Checkout successful!"))
    } else {
        checkoutEvents.WithLabelValues("failure").Inc()
        // Still return a 200 OK because the system itself didn't error,
        // but the business logic failed. This is a key scenario.
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("Checkout failed. Please try again."))
    }
}

Now, our application exposes a metric app_checkout_events_total. We can write a MetricTemplate to calculate the success rate.

yaml

# checkout-metric-template.yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: checkout-success-rate
  namespace: test # Keep application-specific templates in the app's namespace
spec:
  provider:
    type: prometheus
    address: http://prometheus-operated.monitoring.svc.cluster.local:9090
  
  query: |-
    (sum(rate(app_checkout_events_total{ 
      namespace="{{ namespace }}",
      pod=~"^{{ target }}.*$",
      status="success"
    }[{{ interval }}]))
    /
    sum(rate(app_checkout_events_total{ 
      namespace="{{ namespace }}",
      pod=~"^{{ target }}.*$"
    }[{{ interval }}]))
    * 100

Query Dissection:

This query calculates the ratio of successful checkouts to total checkouts and multiplies by 100 to get a percentage. The key is pod=~"^{{ target }}.*$" which correctly selects the pods belonging to either the primary or canary deployment based on their naming convention.

By referencing this template in our Canary object, we've now tied our deployment's safety directly to a critical business outcome. The thresholdRange: { min: 99.5 } in the Canary spec means that the canary's checkout success rate must be at least 99.5% of the primary's success rate. This is a relative check, which is incredibly powerful for noisy metrics. It's not about hitting an absolute number; it's about not being worse than the current baseline.

Implementation Walkthrough: A Failed Deployment

Let's simulate a scenario where we deploy a new version of our podinfo service that introduces a 600ms latency delay.

1. Initial State:

* Deployment podinfo is running app:v1.0.0.

* Service podinfo routes to this deployment.

* Istio VirtualService podinfo has a single route to the podinfo service subset.

* Our Canary podinfo and MetricTemplates are applied.

2. Triggering the Deployment:

We update the podinfo Deployment's container image to app:v2.0.0 (our faulty version).

kubectl set image deployment/podinfo podinfo=ghcr.io/stefanprodan/podinfo:6.0.1 (assuming 6.0.1 is the bad version)

3. Flagger Takes Control:

Flagger detects the change. Looking at the Flagger logs (kubectl logs -f deployment/flagger -n flagger), we'll see the process unfold:

log

INFO podinfo.test - New revision detected! Scaling up podinfo.test
INFO podinfo.test - Starting canary analysis for podinfo.test
INFO podinfo.test - Pre-rollout check acceptance-test passed
INFO podinfo.test - Advance podinfo.test canary weight 10

Flagger has created podinfo-canary and updated the VirtualService to send 10% of traffic to it.

4. Analysis Begins:

After the first interval (1 minute), Flagger runs its metric checks.

log

INFO podinfo.test - Running analysis for podinfo.test
INFO podinfo.test - request-success-rate: 100.00 >= 99
WARN podinfo.test - request-duration: 615.45 > 500
WARN podinfo.test - Canary podinfo.test failed checks: request-duration

The request-duration check, using our p99 latency template, has failed. The measured latency (615ms) is above our thresholdRange.max of 500ms.

5. Iterative Failures and Rollback:

Flagger will continue the analysis. Since the high latency is persistent, the check will fail repeatedly. The threshold is set to 10. After 10 consecutive failures:

log

... (9 more failure logs)

ERROR podinfo.test - Canary failed! Scaling down podinfo.test
INFO podinfo.test - Rolling back podinfo.test.test failed checks threshold reached 10
INFO podinfo.test - Halting podinfo.test advancement
INFO podinfo.test - Promotion failed!

Flagger immediately performs the rollback:

* It reconfigures the Istio VirtualService to send 100% of traffic back to the podinfo primary Deployment.

* It scales the podinfo-canary Deployment down to zero replicas.

The entire incident was contained. Only a small percentage of users were exposed to the degraded performance, and the system automatically healed itself without any human intervention.

Advanced Patterns and Production Edge Cases

While the core loop is powerful, real-world systems present more complexity.

Edge Case 1: Coordinating with Database Migrations

A common challenge is deploying an application change that depends on a database schema migration. A naive canary deployment can be catastrophic: the new v2 code expects a new column, but 90% of traffic is still hitting v1 code which will break if the column is added. The old v1 code will break if the column is removed.

Solution: Multi-Stage Release with Feature Flags and Flagger Webhooks

This requires an expand/contract pattern, orchestrated by your CI/CD system and Flagger:

Phase 1: Deploy v2 code, feature-flagged off. Deploy the new application code (v2) but have it behave identically to v1 via a feature flag. The new code paths are present but disabled. Run a full Flagger progressive delivery. The goal is to verify that the new binary is stable with the old schema.

Phase 2: Apply Additive Schema Change. Run the database migration script to add new columns, tables, etc. This change must be backward-compatible; v1 code must ignore it and continue to function.

Phase 3: Enable Feature Flag via v3 Deployment. Deploy a new version (v3, which is identical to v2 but with the feature flag enabled by default via configuration). Trigger another Flagger progressive delivery. During this phase, Flagger will be analyzing the performance of the new code paths interacting with the new schema.

Phase 4: (Optional) Cleanup. Once v3 is fully rolled out and stable, you can deploy a v4 that removes the old code paths and the feature flag logic. In a separate maintenance window, run a contractive migration to remove the old database columns.

Flagger's webhooks can be used to gate these phases. For example, a post-rollout webhook from Phase 1 could trigger the CI job for Phase 2.

Edge Case 2: A/B Testing vs. Canary Analysis

Sometimes you want to split traffic for A/B testing, not for safety analysis. For example, you want to send all users from Canada to a new version to test a feature.

Solution: Header-Based Routing with Istio

Flagger supports this directly. You can add match and mirror clauses to your Canary spec.

yaml

# A/B testing spec fragment
spec:
  analysis:
    # ... standard canary analysis ...
    # A/B test routing
    iterations: 10
    match:
    - headers:
        x-user-geo:
          exact: "canada"
    mirror:
      - weight: 100

In this setup, Flagger will configure the VirtualService to route any request with the header x-user-geo: canada to the canary. The mirror directive can also be used to send a copy of the traffic for dark launching. The standard stepWeight analysis will run concurrently for the rest of the traffic. This allows you to perform targeted A/B tests within the same safe, progressive delivery framework.

Edge Case 3: Handling Noisy or Low-Volume Metrics

For a low-traffic service, a single failed request can drop the success rate from 100% to 50%, causing a false-positive rollback. Similarly, latency can be spiky.

Solutions:

Increase the analysis.interval: A longer interval (e.g., 5m) provides a larger sample size, smoothing out the data.

Use Averaging in PromQL: Modify your MetricTemplate to use functions like avg_over_time() to smooth out spiky metrics.

Adjust the analysis.threshold: A higher threshold requires the metric to be consistently failing for a longer period before rollback, making the system less sensitive to transient blips.

Generate Synthetic Load: Use a pre-rollout webhook to trigger a load test. This guarantees a minimum level of traffic to the canary, ensuring you have enough data for a statistically significant analysis.

Performance and Resource Considerations

This powerful automation does not come for free.

* Istio Sidecar Overhead: The istio-proxy sidecar injected into each of your application pods consumes CPU and memory. This is typically in the range of 0.1-0.5 vCPU and 50-100MB RAM per pod under load. The impact on latency is usually low (a few milliseconds) but should be measured for your specific workload.

* Prometheus Footprint: A production-grade Prometheus setup can be resource-intensive, particularly in terms of memory and disk I/O, as it scales with metric cardinality and scrape frequency.

* Flagger Controller: Flagger itself is lightweight, typically consuming minimal resources.

For most modern applications, the safety and automation benefits far outweigh the resource cost. The key is to properly capacity-plan your cluster and monitor the monitoring stack itself.

Conclusion: From Release Engineering to Release Science

Implementing a progressive delivery system with Flagger, Istio, and Prometheus is a significant step in maturing a DevOps practice. It shifts the release process from a manual, anxiety-ridden engineering task to an automated, data-driven scientific process. By defining success criteria as a set of observable, quantifiable metrics, you create a resilient, self-healing deployment pipeline that catches not only catastrophic failures but also subtle performance regressions before they impact the majority of your users.

This approach requires a deeper investment in observability and a more declarative mindset, but the payoff is faster, safer, and more frequent releases—a cornerstone of high-performing engineering organizations.

The Fragility of Time-Based Canary Deployments

Assumed Environment

Solution Architecture: A Closed-Loop System

Deep Dive: The Flagger `Canary` Resource

Key `analysis` Fields Explained:

Crafting Custom Metric Templates for Deep Analysis

Scenario 1: Analyzing p99 Latency

Scenario 2: Analyzing a Business-Specific Metric

Implementation Walkthrough: A Failed Deployment

Advanced Patterns and Production Edge Cases

Edge Case 1: Coordinating with Database Migrations

Edge Case 2: A/B Testing vs. Canary Analysis

Edge Case 3: Handling Noisy or Low-Volume Metrics

Performance and Resource Considerations

Conclusion: From Release Engineering to Release Science

Found this article helpful?