PostgreSQL Logical Replication for Zero-Downtime Multi-Region Failover
The Multi-Region Imperative: Beyond Physical Replication
For any system requiring high availability, a multi-region database architecture is non-negotiable. Traditional PostgreSQL physical (streaming) replication is a battle-tested solution for creating hot standbys, but it operates at the block level. This monolithic approach, while simple, imposes significant rigidity. A standby is a byte-for-byte copy of the primary, making tasks like major version upgrades, selective table replication, or planned, zero-downtime failovers operationally complex and fraught with risk.
Enter logical replication. By replicating data at the logical level—re-playing the INSERT, UPDATE, and DELETE operations—it decouples the replica from the primary's physical storage. This opens up a world of architectural flexibility. We can replicate between different major PostgreSQL versions, replicate only a subset of tables, and, most critically for this discussion, orchestrate a controlled, zero-downtime promotion of a replica to a primary.
This article is not a primer on setting up a publication and subscription. We assume you've read the official documentation. Instead, we will construct a production-ready playbook for a multi-region failover, addressing the complex edge cases and operational realities that separate a proof-of-concept from a resilient, production system.
Our scenario: A primary database in us-east-1 and a read-replica in eu-west-1. Our goal is to promote eu-west-1 to become the new primary with zero data loss and sub-minute application downtime, and then re-establish the old primary as a new replica.
Section 1: The Foundation - An Idempotent and Optimized Setup
A robust failover protocol is built upon a correct and performance-aware initial setup. Let's refine the standard process for a high-throughput, multi-region environment.
1.1. Primary (`us-east-1`) Configuration (`postgresql.conf`)
The defaults are insufficient for a production workload across a WAN.
# postgresql.conf on Primary (us-east-1)
# Must be 'logical' to decode WAL into logical changes.
wal_level = logical
# At least one for the subscription, plus others for backups, etc.
# A good starting point is 10.
max_wal_senders = 10
# Must be at least the number of subscriptions you plan to run.
# Also used by physical replication slots.
max_replication_slots = 10
# CRITICAL: Retain WAL files until the subscriber has confirmed receipt.
# This prevents the primary from deleting WAL needed by a lagging replica.
# Without this, a network partition could break replication permanently.
wal_keep_size = '4GB' # Adjust based on write volume and expected max downtime.
1.2. Publication: Granularity is Key
Avoid the temptation to use CREATE PUBLICATION my_pub FOR ALL TABLES;. This is a blunt instrument. A better practice is to manage table inclusion explicitly. This prevents accidental replication of temporary data, unlogged tables, or region-specific configuration tables.
-- On Primary (us-east-1)
CREATE PUBLICATION main_publication
FOR TABLE users, products, orders, order_items
WITH (publish = 'insert, update, delete');
This explicit approach makes DDL management (discussed later) more predictable.
1.3. Initial Seeding: The `copy_data = false` Pattern
The CREATE SUBSCRIPTION ... WITH (copy_data = true) option is convenient for small databases but a production hazard for large ones. It initiates a COPY process that can hold long-running locks on the primary and saturate network bandwidth unpredictably.
The professional pattern involves a manual, more controlled data seeding process.
Step 1: Create a replication slot on the primary. This ensures that WAL changes start accumulating from a known point in time, even before the subscriber is ready.
-- On Primary (us-east-1)
SELECT pg_create_logical_replication_slot('sub_eu_west_1', 'pgoutput');
Step 2: Take a consistent snapshot on the primary. We use pg_dump with the --snapshot option to get a transactionally consistent export without locking out writers for the entire duration.
# On a machine that can connect to the primary
pg_dump 'postgres://user:[email protected]/mydb' \
--schema-only -f schema.sql
pg_dump 'postgres://user:[email protected]/mydb' \
--data-only --snapshot=$(psql 'postgres://user:[email protected]/mydb' -Atc "SELECT pg_export_snapshot();") \
-f data.sql
Step 3: Restore to the replica.
# On a machine that can connect to the replica
psql 'postgres://user:[email protected]/mydb' -f schema.sql
psql 'postgres://user:[email protected]/mydb' -f data.sql
Step 4: Create the subscription. Now we create the subscription, telling it not to copy data but to start streaming changes from the slot we created, which corresponds to the point in time of our snapshot.
-- On Replica (eu-west-1)
CREATE SUBSCRIPTION main_subscription
CONNECTION 'postgres://user:[email protected]/mydb'
PUBLICATION main_publication
WITH (
copy_data = false,
slot_name = 'sub_eu_west_1',
create_slot = false
);
This method minimizes production impact on the primary and provides a predictable, resumable process for initializing large replicas.
Section 2: The Zero-Downtime Failover Protocol - A Step-by-Step Orchestration
This is the core of the operation. It must be scripted and automated. A manual failover under pressure is a recipe for disaster.
Pre-computation: The Sequence Synchronization Problem
A critical oversight in many logical replication guides is sequence synchronization. Sequences are not part of the INSERT/UPDATE/DELETE stream and their state is not replicated. If you fail over, your new primary's sequences will be at their initial START WITH value, leading to immediate primary key constraint violations.
We must capture their state on the old primary after writes have stopped and apply it to the new primary before enabling writes.
Here is a script to generate the necessary setval commands:
-- Filename: generate_sequence_sync.sql
-- Run this on the OLD primary (us-east-1) AFTER writes are quiesced.
SELECT
'SELECT setval(' ||
quote_literal(quote_ident(seq_ns.nspname) || '.' || quote_ident(seq_c.relname)) ||
', ' || pg_sequence_last_value(seq_ns.nspname || '.' || seq_c.relname) ||
', true);' AS cmd
FROM pg_class AS seq_c
JOIN pg_namespace AS seq_ns ON seq_c.relnamespace = seq_ns.oid
WHERE seq_c.relkind = 'S';
This will output a series of SQL commands that we'll execute on the new primary during the failover.
The Failover Script
What follows is a conceptual shell script outlining the orchestration. In a real environment, this would be implemented in a tool like Ansible, Terraform, or a custom Python script.
#!/bin/bash
set -euo pipefail
# --- Configuration ---
OLD_PRIMARY_CONN="postgres://user:[email protected]/mydb"
NEW_PRIMARY_CONN="postgres://user:[email protected]/mydb"
APP_LOAD_BALANCER="lb-prod-abcde12345"
# --- Functions ---
check_replication_lag() {
# This query MUST return a lag of 0 for the failover to proceed.
# It checks the difference between the WAL location the primary has written/flushed
# and the location the replica has received/replayed.
psql "$OLD_PRIMARY_CONN" -Atc "
SELECT
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication
WHERE application_name = 'main_subscription';
"
}
# --- Orchestration ---
# Step 1: Quiesce Application Writes
# This is the most application-specific step. It could involve:
# - An API call to put the application into maintenance mode.
# - Changing load balancer rules to send traffic to static error pages.
# - Scaling down application writer pods in Kubernetes.
echo "[FAILOVER] Step 1: Putting application into maintenance mode..."
# aws elbv2 modify-listener ... or similar command
sleep 10 # Allow in-flight requests to complete
# Step 2: Verify Final Replication Sync
# After writes stop, we must ensure the last few transactions have replicated.
echo "[FAILOVER] Step 2: Verifying final replication sync..."
while true; do
LAG=$(check_replication_lag)
echo "Current replay lag is: $LAG bytes"
if [ "$LAG" -eq 0 ]; then
echo "Replica is fully synced."
break
fi
sleep 2
done
# Step 3: Capture Sequence State from Old Primary
echo "[FAILOVER] Step 3: Capturing sequence state..."
psql "$OLD_PRIMARY_CONN" -f generate_sequence_sync.sql > set_sequences.sql
# Step 4: Promote the Replica
echo "[FAILOVER] Step 4: Promoting replica in eu-west-1..."
# Disable the subscription so it doesn't try to connect to the old master.
psql "$NEW_PRIMARY_CONN" -c "ALTER SUBSCRIPTION main_subscription DISABLE;"
# Drop the subscription. We'll create a new one for failback later.
psql "$NEW_PRIMARY_CONN" -c "ALTER SUBSCRIPTION main_subscription SET (slot_name = NONE);"
psql "$NEW_PRIMARY_CONN" -c "DROP SUBSCRIPTION main_subscription;"
# At this point, the replica is a standalone, writable database.
# Step 5: Synchronize Sequences on New Primary
echo "[FAILOVER] Step 5: Applying sequence state to new primary..."
psql "$NEW_PRIMARY_CONN" -f set_sequences.sql
# Step 6: Redirect Application Traffic
echo "[FAILOVER] Step 6: Redirecting application traffic to new primary..."
# Update DNS, load balancer target group, or application configuration
# to point to the NEW_PRIMARY_CONN.
# aws elbv2 modify-target-group ...
# --- DOWNTIME ENDS HERE ---
# The application is now live, writing to the new primary in eu-west-1.
echo "[FAILOVER] COMPLETE. Application is now live on eu-west-1."
# Step 7: Reconfigure Old Primary as a New Replica (Failback Preparation)
echo "[FAILOVER] Step 7: Reconfiguring old primary as a replica..."
# Create a new publication on the NEW primary (eu-west-1)
psql "$NEW_PRIMARY_CONN" -c "CREATE PUBLICATION failback_publication FOR TABLE users, products, orders, order_items;"
# On the OLD primary (us-east-1), clean up and subscribe to the new primary.
psql "$OLD_PRIMARY_CONN" -c "DROP PUBLICATION main_publication;"
# ... add logic here to clean up old data if necessary ...
psql "$OLD_PRIMARY_CONN" -c "
CREATE SUBSCRIPTION sub_to_eu_west_1
CONNECTION '$NEW_PRIMARY_CONN'
PUBLICATION failback_publication;"
echo "[FAILOVER] Failback path established. us-east-1 is now a replica of eu-west-1."
This script highlights the necessary steps: quiesce, verify, promote, sync sequences, and redirect. The downtime is precisely the duration between Step 1 and Step 6, which can be under 30 seconds with proper automation.
Section 3: Advanced Patterns and Production Nightmares
3.1. DDL Management: The Achilles' Heel
Logical replication does not propagate DDL changes (ALTER TABLE, CREATE INDEX, etc.). This is a major operational hurdle. A disciplined, manual process is required.
The Safe DDL Deployment Protocol:
ALTER TABLE or CREATE INDEX on the replica (eu-west-1). This is safe because it's read-only. For additive, non-breaking changes (like adding a nullable column), this is straightforward. For breaking changes (dropping a column), this requires careful application-level coordination.us-east-1). -- On Primary (us-east-1)
ALTER PUBLICATION main_publication REFRESH;
Correction: A common misconception is that REFRESH is needed. For ALTER TABLE ADD COLUMN, a simple ALTER PUBLICATION ... ADD TABLE ... is often sufficient if the table was already part of the publication. The most robust method is to apply DDL on the subscriber, then the primary. The replication of INSERT/UPDATE data for the new column will begin automatically.
Failure to follow this order can break replication. For example, if the primary starts replicating data for a new column that doesn't yet exist on the subscriber, the subscriber's apply worker will fail with an error, and replication will halt until the DDL is manually applied.
3.2. Monitoring: Beyond Basic Lag
Replication lag is the most obvious metric, but it's not the only one. A silent killer is replication slot bloat.
If the replica (eu-west-1) disconnects from the primary for an extended period (e.g., due to a network partition), the replication slot on the primary will prevent the deletion of WAL files, as it believes they are still needed by the subscriber. This will cause the primary's disk to fill up, eventually crashing the entire database.
Production-Grade Monitoring Query (for Prometheus/Datadog):
-- Run on the PRIMARY database
SELECT
slot_name,
active,
-- The amount of WAL data the primary is retaining for this slot.
-- A continuously growing value here is a major red alert.
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS slot_wal_retained_bytes,
-- A more user-friendly lag metric, calculated from the replica's perspective.
-- This will be NULL if the replica is disconnected.
(SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication WHERE application_name = s.slot_name) AS replay_lag_bytes
FROM pg_replication_slots s;
You MUST have an alert on slot_wal_retained_bytes that triggers if it exceeds a threshold (e.g., 10GB). This gives you time to investigate the disconnected subscriber before it takes down your primary.
3.3. Conflict Resolution
Logical replication's default behavior is "first writer wins." If, by some error, a write occurs on the replica for a row that is then updated by the primary, the replication apply worker will fail when it tries to apply the primary's change. Replication halts.
There is no built-in conflict resolution. The only robust production strategy is to enforce read-only access on the replica at all times. This can be done via permissions:
-- On Replica (eu-west-1)
ALTER DEFAULT PRIVILEGES FOR ROLE application_user IN SCHEMA public
REVOKE INSERT, UPDATE, DELETE ON TABLES FROM application_user;
-- And for existing tables
REVOKE INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public FROM application_user;
Your failover protocol should include steps to grant these privileges on the newly promoted primary and revoke them on the newly demoted replica.
Section 4: Performance Tuning for High-Latency Environments
Replicating data across continents introduces latency. Here's how to optimize for it.
4.1. Subscriber-Side Parallelism
By default, a single apply worker processes all incoming changes serially. For a high-throughput primary, this can become a bottleneck. PostgreSQL 14+ introduced parallel apply workers.
-- On Replica (eu-west-1)
ALTER SUBSCRIPTION main_subscription SET (max_parallel_apply_workers_per_subscription = 4);
When to use this: This is effective when your workload consists of many transactions on different, independent tables. If all your traffic is hammering a single orders table, parallelism won't help as the workers will still need to serialize changes to that table. Benchmark this setting against your specific workload.
4.2. Primary-Side Commit Tuning
On the primary, you can batch commits to reduce the number of WAL flushes, which can improve throughput at the cost of slightly higher latency for any single transaction.
# postgresql.conf on Primary (us-east-1)
# Wait up to 500 microseconds for other transactions to commit
# before flushing WAL. Can group multiple commits into one I/O operation.
commit_delay = 500
# How many concurrent transactions must be active for commit_delay to kick in.
# Prevents adding latency for low-traffic periods.
commit_siblings = 5
This is a trade-off. It can increase overall replication throughput by sending larger, more efficient chunks of WAL, but it adds latency to individual commits on the primary. This is often a worthwhile trade-off for very high-volume insert/update workloads.
Conclusion: A Tool of Power and Precision
Logical replication is not a simple drop-in for physical replication. It's a more powerful, more flexible tool that demands a higher level of operational discipline. A successful multi-region, zero-downtime architecture using logical replication depends less on the initial CREATE SUBSCRIPTION command and more on the robust, scripted, and well-monitored processes built around it.
By scripting the failover protocol, meticulously managing DDL, synchronizing database sequences, and monitoring for silent killers like replication slot bloat, you can build a truly resilient system. The complexity is front-loaded into the engineering and automation, but the payoff is an architecture that can gracefully handle both planned maintenance and unforeseen disasters with the precision required by modern, always-on applications.