Stateful Canary Deployments in Istio with Flagger and Prometheus
The Senior Engineer's Dilemma: The Stateful Canary
As engineers, we've mastered the art of canary releasing stateless applications. We use service meshes like Istio to declaratively shift a small percentage of traffic to a new version, monitor dashboards in Grafana, and confidently promote or roll back. It's a cornerstone of modern CI/CD. But this confidence shatters the moment a service is tightly coupled to a persistent data store undergoing a schema change.
Consider a user-profile-service backed by a PostgreSQL database. The product team wants to add a mandatory preferred_language field. A naive canary deployment—splitting traffic 90/10 between v1 (unaware of the new field) and v2 (expecting the new field)—is a recipe for production failure:
preferred_language NOT NULL, any INSERT from the v1 service instance will fail immediately, triggering a cascade of errors.NULLABLE to accommodate v1, the v2 service might read a user record written by v1, find a NULL where it expects a value, and throw a NullPointerException or exhibit other undefined behavior.v2 writes data in a new format and is then rolled back, the primary v1 instances may be unable to parse this new data, leading to a full-blown outage even after the rollback is complete.This is not a theoretical problem. It's a high-stakes challenge that keeps SREs and senior developers up at night. The solution requires moving beyond simple traffic management and architecting a holistic deployment strategy that synchronizes application logic, traffic routing, and database state transitions. This article details a production-proven pattern using Istio, Flagger, and Prometheus to execute safe, automated canary deployments for stateful services.
We will assume you have a working knowledge of Kubernetes, Istio's VirtualService and DestinationRule, and the basic concepts of Flagger and Prometheus. We're here to connect the dots in a non-trivial, production-oriented way.
The Architecture: A Multi-Phase, Backwards-Compatible Approach
To de-risk the deployment, we must break the atomic coupling between the application update and the database schema change. We'll adopt a multi-phase approach rooted in the principle of backwards-compatible changes. This is often referred to as the "Expand/Contract" pattern.
Our goal is to add a preferred_language column to the users table. Instead of a single, big-bang deployment, we'll use three distinct phases:
Application Logic: Deploy a new version of the service, v1.1. This version is designed to be tolerant of the upcoming schema. It can read records with or without the preferred_language field and will write NULL or a default value to it if the data is not provided. Crucially, it does not* yet depend on the new field for its core logic.
* Database Migration: Before v1.1 is fully deployed, run a migration script that adds the preferred_language column as NULLABLE.
* Outcome: The entire cluster is now running v1.1 on a schema that supports both old and new data formats. The system is stable and ready for the real change.
* Application Logic: This is our target version, v2.0. It actively reads and writes the preferred_language field and contains the new business logic that depends on it.
Automated Canary Analysis (Flagger): This is the core of our strategy. Flagger will gradually shift traffic to v2.0 while monitoring a set of metrics. This set will include standard Istio metrics (success rate, latency) and* critical custom application metrics that validate data consistency.
* Outcome: If the analysis succeeds, Flagger promotes v2.0 to be the new primary, and all traffic is routed to it. If it fails, traffic is immediately routed back to the stable v1.1 primary.
* Application Logic: A minor cleanup release, v2.1, can be deployed to remove the backwards-compatibility code (e.g., handling NULL values for the new field).
* Database Migration: A post-deployment script alters the preferred_language column to be NOT NULL and potentially backfills any remaining NULL values. This enforces the schema constraint permanently.
* Outcome: The system is now in its final, clean state.
This article will focus intensely on implementing Phase 2, as it contains the most complex automation and is the highest-risk step.
Deep Dive: Implementing the Stateful Canary with Flagger
Let's build out the scenario. We have a Go-based user-profile-service with a v1.1 version deployed and running stably in our Kubernetes cluster.
Step 1: Instrumenting the Application for State-Aware Metrics
Standard metrics like http_requests_total are insufficient. We need to know if the canary is causing data-related problems. We must instrument our application to expose custom Prometheus metrics. In our Go service, we'll use the prometheus/client_golang library.
// main.go (simplified)
package main
import (
"database/sql"
"encoding/json"
"log"
"net/http"
_ "github.com/lib/pq"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
db *sql.DB
// Custom Prometheus metric: Counter for when we read a user record
// written by an older version of the service (i.e., preferred_language is NULL).
legacyRecordReads = prometheus.NewCounter(
prometheus.CounterOpts{
Name: "user_profile_legacy_record_reads_total",
Help: "Total number of user records read that were written by a pre-v2 service version.",
},
)
// Custom Prometheus metric: Counter for write errors specifically related to our new field.
// This is more specific and useful than a generic DB error counter.
schemaWriteErrors = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "user_profile_schema_write_errors_total",
Help: "Total number of database write errors related to schema mismatches.",
},
[]string{"error_type"},
)
)
func init() {
prometheus.MustRegister(legacyRecordReads)
prometheus.MustRegister(schemaWriteErrors)
}
type User struct {
ID int `json:"id"`
Username string `json:"username"`
PreferredLanguage string `json:"preferred_language"` // v2.0 uses this
}
// In our v2.0 application logic:
func GetUserHandler(w http.ResponseWriter, r *http.Request) {
// ... get user ID from request ...
var user User
var lang sql.NullString // Use sql.NullString for backwards compatibility
row := db.QueryRow("SELECT id, username, preferred_language FROM users WHERE id = $1", userID)
err := row.Scan(&user.ID, &user.Username, &lang)
if err != nil {
// Handle error
return
}
if !lang.Valid {
// This record was written by v1.0 or v1.1. It's a legacy record.
legacyRecordReads.Inc()
user.PreferredLanguage = "en-US" // Apply a default
} else {
user.PreferredLanguage = lang.String
}
json.NewEncoder(w).Encode(user)
}
func main() {
// ... database connection setup ...
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/user", GetUserHandler)
log.Fatal(http.ListenAndServe(":8080", nil))
}
This instrumentation gives us two crucial signals:
user_profile_legacy_record_reads_total: During the canary rollout, we expect this number to decrease as v2.0 writes new records in the new format. A flat or increasing rate could signal a problem.user_profile_schema_write_errors_total: This should be zero. Any increase is a critical failure signal.Step 2: Defining the Flagger Canary Resource
Now we create the Canary Custom Resource Definition (CRD) that tells Flagger how to manage the deployment of user-profile-service. This YAML is the heart of our automated strategy.
# canary-user-profile.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: user-profile-service
namespace: production
spec:
# Reference to the Kubernetes Deployment we are managing
targetRef:
apiVersion: apps/v1
kind: Deployment
name: user-profile-service
# Reference to the Service that points to the Deployment
service:
port: 80
targetPort: 8080 # The port our Go app listens on
# The core of the canary analysis
analysis:
# Analysis runs every 30 seconds
interval: 30s
# Promotion is blocked if analysis fails more than 5 times
threshold: 5
# Total duration of the canary analysis
stepWeight: 10 # Increase traffic by 10% each step
maxWeight: 50 # Cap canary traffic at 50%
# METRICS: Standard and Custom
metrics:
# Standard Istio Metric: Request success rate must be over 99%
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
# Standard Istio Metric: 99th percentile request latency must be under 500ms
- name: request-duration
thresholdRange:
max: 500
interval: 1m
# CUSTOM METRIC 1: Check for schema-related write errors.
# This is a hard gate. If this query returns any result, the canary fails.
- name: db-schema-write-errors
templateRef:
name: db-schema-errors-check
namespace: flagger-system
thresholdRange:
max: 0 # Must be zero
interval: 1m
# CUSTOM METRIC 2: Ensure the ratio of legacy reads is decreasing.
# This query checks if the rate of old-format records being read is not growing.
- name: legacy-read-ratio
templateRef:
name: legacy-read-ratio-check
namespace: flagger-system
# We expect the ratio to be less than 1 (i.e., not increasing compared to primary)
thresholdRange:
max: 1
interval: 1m
# WEBHOOKS: For pre/post-rollout actions
webhooks:
# Before shifting any traffic, run a check.
- name: "confirm-db-migration-v2"
type: pre-rollout
url: http://gatekeeper.ops-tools.svc.cluster.local/check-schema
timeout: 30s
metadata:
service: "user-profile-service"
schema_version: "2.0"
# After successful promotion, we could trigger a cleanup job.
- name: "trigger-schema-contract-job"
type: post-rollout
url: http://job-runner.ops-tools.svc.cluster.local/run
timeout: 1m
metadata:
job: "user-profile-schema-contract"
Step 3: Defining Custom Metric Templates
The Canary resource references MetricTemplates. These are reusable templates that contain the actual PromQL queries. This separation keeps the Canary definition clean.
First, the template for our critical schema error check:
# metric-template-db-errors.yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: db-schema-errors-check
namespace: flagger-system
spec:
provider:
type: prometheus
address: http://prometheus.istio-system.svc.cluster.local:9090
# This query sums the rate of our custom schema write error counter
# ONLY for the pods belonging to the canary deployment.
# Flagger replaces {{ namespace }}, {{ target }}, etc. at runtime.
query: >
sum(rate(user_profile_schema_write_errors_total{
namespace="{{ namespace }}",
pod=~"^{{ target }}.*"
}[1m]))
Next, the more complex query to check the ratio of legacy reads. We want to ensure the canary isn't reading more legacy records per second than the primary. A stable or decreasing ratio is healthy.
# metric-template-legacy-reads.yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: legacy-read-ratio-check
namespace: flagger-system
spec:
provider:
type: prometheus
address: http://prometheus.istio-system.svc.cluster.local:9090
query: >
(
sum(rate(user_profile_legacy_record_reads_total{
namespace="{{ namespace }}",
pod=~"^{{ target }}.*"
}[1m]))
/
sum(rate(user_profile_legacy_record_reads_total{
namespace="{{ namespace }}",
pod=~"^{{ primary }}.*"
}[1m]))
) or on() vector(0)
Note on the PromQL: The or on() vector(0) clause is crucial. If the primary has zero legacy reads (rate is 0), the division would result in NaN (Not a Number), causing Flagger's analysis to fail. This clause gracefully defaults the result to 0 in such cases, preventing false negatives.
Step 4: The Pre-Rollout Webhook Gate
The pre-rollout webhook is our final safety gate. Before Flagger shifts a single packet of user traffic, it will call this webhook. We can build a simple internal service (gatekeeper) that checks the database's internal schema_migrations table (or equivalent) to confirm that the required migration for v2.0 has been successfully applied.
gatekeeper service responds with 200 OK, Flagger proceeds with the canary analysis.v2.0 application code from ever running against an incorrect database schema.Visualizing the Process
When a new image for user-profile-service is pushed, the process unfolds automatically:
user-profile-service Deployment spec with the new image tag (v2.0).- Flagger's controller detects this change.
Deployment for the canary: user-profile-service-canary.gatekeeper webhook. The gatekeeper checks the DB schema version. Let's assume it passes.VirtualService to send 10% of traffic to the canary pods.* Flagger queries Prometheus for success rate and latency.
* Flagger runs the db-schema-write-errors query. The result must be 0.
* Flagger runs the legacy-read-ratio query. The result must be <= 1.
- If all metrics are within their thresholds for 5 consecutive checks, Flagger increases the traffic weight to 20%.
- This loop continues until the traffic weight reaches 50%.
Deployment (v1.1), scales up the canary Deployment (v2.0), and updates the main Deployment to the new version. The VirtualService is reset to send 100% of traffic to the new primary.post-rollout webhook, which could trigger a Jenkins job or Argo Workflow to run the schema cleanup/contraction tasks.If at any point a metric threshold is breached, Flagger immediately aborts the analysis, resets the VirtualService to send 100% of traffic back to the v1.1 primary, and scales down the canary pods. The deployment has failed safely.
Edge Cases and Advanced Considerations
This pattern is robust, but senior engineers must consider the boundaries.
* Handling Long-Running Migrations: The pre-rollout check assumes the migration is already complete. For massive tables where a migration might take hours, you cannot run it synchronously within a CI/CD pipeline. In these cases, the migration should be triggered out-of-band using a dedicated tool (e.g., gh-ost, pt-online-schema-change). The webhook's role remains the same: to act as a gate, verifying completion, not execution.
* Idempotent and Reversible Migrations: What if the canary fails and rolls back? The schema has still been changed. All migration scripts must be designed to be idempotent (running them multiple times has the same effect as running them once) and, ideally, reversible. The backwards-compatible application code (v1.1) ensures that the system remains stable even if the canary v2.0 is rolled back, as it can tolerate the new (but unused) schema.
* Transactional Guarantees: This pattern does not magically solve transactional issues during the traffic split. If v1.1 and v2.0 both modify the same row in a single user transaction, you can still have race conditions. The design of the application logic during this transitional phase is critical. Operations should be structured to be as commutative and conflict-free as possible.
* Read-Your-Own-Writes Consistency: During the canary, a user's request might be served by v2.0 on write, but a subsequent read milliseconds later could be routed to v1.1. If v1.1 cannot interpret the data written by v2.0, the user sees an error. This is another reason why the v1.1 (primary) must be fully forwards-compatible with data written by the canary.
Conclusion: From Risk to Repeatability
Deploying stateful services is inherently more complex than their stateless counterparts. By abandoning the monolithic deploy-and-migrate approach in favor of a multi-phase, automated strategy, we transform a high-risk, manual process into a repeatable, observable, and safe engineering practice.
The combination of Istio's precise traffic control, Flagger's progressive delivery automation, and Prometheus's deep, queryable observability provides the necessary toolkit. The key, however, is not the tools themselves, but the architectural pattern: instrumenting your application with state-aware metrics, designing for backwards compatibility, and using automated gates like webhooks to enforce preconditions. This approach allows development teams to move faster, not by cutting corners, but by building a sophisticated safety net that understands the intricate dance between code and data.