Stateful Canary Deployments in Istio with Flagger and Prometheus

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Senior Engineer's Dilemma: The Stateful Canary

As engineers, we've mastered the art of canary releasing stateless applications. We use service meshes like Istio to declaratively shift a small percentage of traffic to a new version, monitor dashboards in Grafana, and confidently promote or roll back. It's a cornerstone of modern CI/CD. But this confidence shatters the moment a service is tightly coupled to a persistent data store undergoing a schema change.

Consider a user-profile-service backed by a PostgreSQL database. The product team wants to add a mandatory preferred_language field. A naive canary deployment—splitting traffic 90/10 between v1 (unaware of the new field) and v2 (expecting the new field)—is a recipe for production failure:

  • The Write-Path Catastrophe: If the database schema is updated to include preferred_language NOT NULL, any INSERT from the v1 service instance will fail immediately, triggering a cascade of errors.
  • The "Split-Brain" Database: If the column is added as NULLABLE to accommodate v1, the v2 service might read a user record written by v1, find a NULL where it expects a value, and throw a NullPointerException or exhibit other undefined behavior.
  • The Irreversible Rollback: If the canary v2 writes data in a new format and is then rolled back, the primary v1 instances may be unable to parse this new data, leading to a full-blown outage even after the rollback is complete.
  • This is not a theoretical problem. It's a high-stakes challenge that keeps SREs and senior developers up at night. The solution requires moving beyond simple traffic management and architecting a holistic deployment strategy that synchronizes application logic, traffic routing, and database state transitions. This article details a production-proven pattern using Istio, Flagger, and Prometheus to execute safe, automated canary deployments for stateful services.

    We will assume you have a working knowledge of Kubernetes, Istio's VirtualService and DestinationRule, and the basic concepts of Flagger and Prometheus. We're here to connect the dots in a non-trivial, production-oriented way.

    The Architecture: A Multi-Phase, Backwards-Compatible Approach

    To de-risk the deployment, we must break the atomic coupling between the application update and the database schema change. We'll adopt a multi-phase approach rooted in the principle of backwards-compatible changes. This is often referred to as the "Expand/Contract" pattern.

    Our goal is to add a preferred_language column to the users table. Instead of a single, big-bang deployment, we'll use three distinct phases:

  • Phase 1: The Preparatory Deployment (v1.1 - The "Expand" Phase)
  • Application Logic: Deploy a new version of the service, v1.1. This version is designed to be tolerant of the upcoming schema. It can read records with or without the preferred_language field and will write NULL or a default value to it if the data is not provided. Crucially, it does not* yet depend on the new field for its core logic.

    * Database Migration: Before v1.1 is fully deployed, run a migration script that adds the preferred_language column as NULLABLE.

    * Outcome: The entire cluster is now running v1.1 on a schema that supports both old and new data formats. The system is stable and ready for the real change.

  • Phase 2: The Canary Deployment (v2.0 - The "Live" Phase)
  • * Application Logic: This is our target version, v2.0. It actively reads and writes the preferred_language field and contains the new business logic that depends on it.

    Automated Canary Analysis (Flagger): This is the core of our strategy. Flagger will gradually shift traffic to v2.0 while monitoring a set of metrics. This set will include standard Istio metrics (success rate, latency) and* critical custom application metrics that validate data consistency.

    * Outcome: If the analysis succeeds, Flagger promotes v2.0 to be the new primary, and all traffic is routed to it. If it fails, traffic is immediately routed back to the stable v1.1 primary.

  • Phase 3: The Cleanup Deployment (v2.1 - The "Contract" Phase)
  • * Application Logic: A minor cleanup release, v2.1, can be deployed to remove the backwards-compatibility code (e.g., handling NULL values for the new field).

    * Database Migration: A post-deployment script alters the preferred_language column to be NOT NULL and potentially backfills any remaining NULL values. This enforces the schema constraint permanently.

    * Outcome: The system is now in its final, clean state.

    This article will focus intensely on implementing Phase 2, as it contains the most complex automation and is the highest-risk step.

    Deep Dive: Implementing the Stateful Canary with Flagger

    Let's build out the scenario. We have a Go-based user-profile-service with a v1.1 version deployed and running stably in our Kubernetes cluster.

    Step 1: Instrumenting the Application for State-Aware Metrics

    Standard metrics like http_requests_total are insufficient. We need to know if the canary is causing data-related problems. We must instrument our application to expose custom Prometheus metrics. In our Go service, we'll use the prometheus/client_golang library.

    go
    // main.go (simplified)
    package main
    
    import (
    	"database/sql"
    	"encoding/json"
    	"log"
    	"net/http"
    
    	_ "github.com/lib/pq"
    	"github.com/prometheus/client_golang/prometheus"
    	"github.com/prometheus/client_golang/prometheus/promhttp"
    )
    
    var (
    	db *sql.DB
    
    	// Custom Prometheus metric: Counter for when we read a user record
    	// written by an older version of the service (i.e., preferred_language is NULL).
    	legacyRecordReads = prometheus.NewCounter(
    		prometheus.CounterOpts{
    			Name: "user_profile_legacy_record_reads_total",
    			Help: "Total number of user records read that were written by a pre-v2 service version.",
    		},
    	)
    
    	// Custom Prometheus metric: Counter for write errors specifically related to our new field.
    	// This is more specific and useful than a generic DB error counter.
    	schemaWriteErrors = prometheus.NewCounterVec(
    		prometheus.CounterOpts{
    			Name: "user_profile_schema_write_errors_total",
    			Help: "Total number of database write errors related to schema mismatches.",
    		},
    		[]string{"error_type"},
    	)
    )
    
    func init() {
    	prometheus.MustRegister(legacyRecordReads)
    	prometheus.MustRegister(schemaWriteErrors)
    }
    
    type User struct {
    	ID               int    `json:"id"`
    	Username         string `json:"username"`
    	PreferredLanguage string `json:"preferred_language"` // v2.0 uses this
    }
    
    // In our v2.0 application logic:
    func GetUserHandler(w http.ResponseWriter, r *http.Request) {
    	// ... get user ID from request ...
    	var user User
    	var lang sql.NullString // Use sql.NullString for backwards compatibility
    
    	row := db.QueryRow("SELECT id, username, preferred_language FROM users WHERE id = $1", userID)
    	err := row.Scan(&user.ID, &user.Username, &lang)
    	if err != nil {
    		// Handle error
    		return
    	}
    
    	if !lang.Valid {
    		// This record was written by v1.0 or v1.1. It's a legacy record.
    		legacyRecordReads.Inc()
    		user.PreferredLanguage = "en-US" // Apply a default
    	} else {
    		user.PreferredLanguage = lang.String
    	}
    
    	json.NewEncoder(w).Encode(user)
    }
    
    func main() {
    	// ... database connection setup ...
    
    	http.Handle("/metrics", promhttp.Handler())
    	http.HandleFunc("/user", GetUserHandler)
    	log.Fatal(http.ListenAndServe(":8080", nil))
    }

    This instrumentation gives us two crucial signals:

  • user_profile_legacy_record_reads_total: During the canary rollout, we expect this number to decrease as v2.0 writes new records in the new format. A flat or increasing rate could signal a problem.
  • user_profile_schema_write_errors_total: This should be zero. Any increase is a critical failure signal.
  • Step 2: Defining the Flagger Canary Resource

    Now we create the Canary Custom Resource Definition (CRD) that tells Flagger how to manage the deployment of user-profile-service. This YAML is the heart of our automated strategy.

    yaml
    # canary-user-profile.yaml
    apiVersion: flagger.app/v1beta1
    kind: Canary
    metadata:
      name: user-profile-service
      namespace: production
    spec:
      # Reference to the Kubernetes Deployment we are managing
      targetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: user-profile-service
    
      # Reference to the Service that points to the Deployment
      service:
        port: 80
        targetPort: 8080 # The port our Go app listens on
    
      # The core of the canary analysis
      analysis:
        # Analysis runs every 30 seconds
        interval: 30s
        # Promotion is blocked if analysis fails more than 5 times
        threshold: 5
        # Total duration of the canary analysis
        stepWeight: 10 # Increase traffic by 10% each step
        maxWeight: 50  # Cap canary traffic at 50%
    
        # METRICS: Standard and Custom
        metrics:
          # Standard Istio Metric: Request success rate must be over 99%
          - name: request-success-rate
            thresholdRange:
              min: 99
            interval: 1m
    
          # Standard Istio Metric: 99th percentile request latency must be under 500ms
          - name: request-duration
            thresholdRange:
              max: 500
            interval: 1m
    
          # CUSTOM METRIC 1: Check for schema-related write errors.
          # This is a hard gate. If this query returns any result, the canary fails.
          - name: db-schema-write-errors
            templateRef:
              name: db-schema-errors-check
              namespace: flagger-system
            thresholdRange:
              max: 0 # Must be zero
            interval: 1m
    
          # CUSTOM METRIC 2: Ensure the ratio of legacy reads is decreasing.
          # This query checks if the rate of old-format records being read is not growing.
          - name: legacy-read-ratio
            templateRef:
              name: legacy-read-ratio-check
              namespace: flagger-system
            # We expect the ratio to be less than 1 (i.e., not increasing compared to primary)
            thresholdRange:
              max: 1 
            interval: 1m
    
        # WEBHOOKS: For pre/post-rollout actions
        webhooks:
          # Before shifting any traffic, run a check.
          - name: "confirm-db-migration-v2"
            type: pre-rollout
            url: http://gatekeeper.ops-tools.svc.cluster.local/check-schema
            timeout: 30s
            metadata:
              service: "user-profile-service"
              schema_version: "2.0"
    
          # After successful promotion, we could trigger a cleanup job.
          - name: "trigger-schema-contract-job"
            type: post-rollout
            url: http://job-runner.ops-tools.svc.cluster.local/run
            timeout: 1m
            metadata:
              job: "user-profile-schema-contract"

    Step 3: Defining Custom Metric Templates

    The Canary resource references MetricTemplates. These are reusable templates that contain the actual PromQL queries. This separation keeps the Canary definition clean.

    First, the template for our critical schema error check:

    yaml
    # metric-template-db-errors.yaml
    apiVersion: flagger.app/v1beta1
    kind: MetricTemplate
    metadata:
      name: db-schema-errors-check
      namespace: flagger-system
    spec:
      provider:
        type: prometheus
        address: http://prometheus.istio-system.svc.cluster.local:9090
    
      # This query sums the rate of our custom schema write error counter
      # ONLY for the pods belonging to the canary deployment.
      # Flagger replaces {{ namespace }}, {{ target }}, etc. at runtime.
      query: >
        sum(rate(user_profile_schema_write_errors_total{ 
          namespace="{{ namespace }}", 
          pod=~"^{{ target }}.*" 
        }[1m]))

    Next, the more complex query to check the ratio of legacy reads. We want to ensure the canary isn't reading more legacy records per second than the primary. A stable or decreasing ratio is healthy.

    yaml
    # metric-template-legacy-reads.yaml
    apiVersion: flagger.app/v1beta1
    kind: MetricTemplate
    metadata:
      name: legacy-read-ratio-check
      namespace: flagger-system
    spec:
      provider:
        type: prometheus
        address: http://prometheus.istio-system.svc.cluster.local:9090
    
      query: >
        (
          sum(rate(user_profile_legacy_record_reads_total{ 
            namespace="{{ namespace }}", 
            pod=~"^{{ target }}.*" 
          }[1m]))
          /
          sum(rate(user_profile_legacy_record_reads_total{ 
            namespace="{{ namespace }}", 
            pod=~"^{{ primary }}.*" 
          }[1m]))
        ) or on() vector(0)

    Note on the PromQL: The or on() vector(0) clause is crucial. If the primary has zero legacy reads (rate is 0), the division would result in NaN (Not a Number), causing Flagger's analysis to fail. This clause gracefully defaults the result to 0 in such cases, preventing false negatives.

    Step 4: The Pre-Rollout Webhook Gate

    The pre-rollout webhook is our final safety gate. Before Flagger shifts a single packet of user traffic, it will call this webhook. We can build a simple internal service (gatekeeper) that checks the database's internal schema_migrations table (or equivalent) to confirm that the required migration for v2.0 has been successfully applied.

  • If the gatekeeper service responds with 200 OK, Flagger proceeds with the canary analysis.
  • If it responds with any non-2xx status code, Flagger halts the deployment and marks it as failed. This prevents the v2.0 application code from ever running against an incorrect database schema.
  • Visualizing the Process

    When a new image for user-profile-service is pushed, the process unfolds automatically:

  • CI/CD pipeline updates the user-profile-service Deployment spec with the new image tag (v2.0).
    • Flagger's controller detects this change.
  • Flagger creates a new Deployment for the canary: user-profile-service-canary.
  • Pre-Rollout Check: Flagger calls the gatekeeper webhook. The gatekeeper checks the DB schema version. Let's assume it passes.
  • Flagger modifies the Istio VirtualService to send 10% of traffic to the canary pods.
  • Analysis Loop (every 30s):
  • * Flagger queries Prometheus for success rate and latency.

    * Flagger runs the db-schema-write-errors query. The result must be 0.

    * Flagger runs the legacy-read-ratio query. The result must be <= 1.

    • If all metrics are within their thresholds for 5 consecutive checks, Flagger increases the traffic weight to 20%.
    • This loop continues until the traffic weight reaches 50%.
  • After successfully passing analysis at 50%, Flagger promotes the canary. It scales down the primary Deployment (v1.1), scales up the canary Deployment (v2.0), and updates the main Deployment to the new version. The VirtualService is reset to send 100% of traffic to the new primary.
  • Post-Rollout Action: Flagger calls the post-rollout webhook, which could trigger a Jenkins job or Argo Workflow to run the schema cleanup/contraction tasks.
  • If at any point a metric threshold is breached, Flagger immediately aborts the analysis, resets the VirtualService to send 100% of traffic back to the v1.1 primary, and scales down the canary pods. The deployment has failed safely.

    Edge Cases and Advanced Considerations

    This pattern is robust, but senior engineers must consider the boundaries.

    * Handling Long-Running Migrations: The pre-rollout check assumes the migration is already complete. For massive tables where a migration might take hours, you cannot run it synchronously within a CI/CD pipeline. In these cases, the migration should be triggered out-of-band using a dedicated tool (e.g., gh-ost, pt-online-schema-change). The webhook's role remains the same: to act as a gate, verifying completion, not execution.

    * Idempotent and Reversible Migrations: What if the canary fails and rolls back? The schema has still been changed. All migration scripts must be designed to be idempotent (running them multiple times has the same effect as running them once) and, ideally, reversible. The backwards-compatible application code (v1.1) ensures that the system remains stable even if the canary v2.0 is rolled back, as it can tolerate the new (but unused) schema.

    * Transactional Guarantees: This pattern does not magically solve transactional issues during the traffic split. If v1.1 and v2.0 both modify the same row in a single user transaction, you can still have race conditions. The design of the application logic during this transitional phase is critical. Operations should be structured to be as commutative and conflict-free as possible.

    * Read-Your-Own-Writes Consistency: During the canary, a user's request might be served by v2.0 on write, but a subsequent read milliseconds later could be routed to v1.1. If v1.1 cannot interpret the data written by v2.0, the user sees an error. This is another reason why the v1.1 (primary) must be fully forwards-compatible with data written by the canary.

    Conclusion: From Risk to Repeatability

    Deploying stateful services is inherently more complex than their stateless counterparts. By abandoning the monolithic deploy-and-migrate approach in favor of a multi-phase, automated strategy, we transform a high-risk, manual process into a repeatable, observable, and safe engineering practice.

    The combination of Istio's precise traffic control, Flagger's progressive delivery automation, and Prometheus's deep, queryable observability provides the necessary toolkit. The key, however, is not the tools themselves, but the architectural pattern: instrumenting your application with state-aware metrics, designing for backwards compatibility, and using automated gates like webhooks to enforce preconditions. This approach allows development teams to move faster, not by cutting corners, but by building a sophisticated safety net that understands the intricate dance between code and data.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles