Global Low-Latency Reads with CockroachDB's Geo-Partitioning
The Inescapable Problem: Physics vs. Global Consistency
As a senior engineer architecting a global application, you've inevitably faced the trade-off between consistency and latency. In a distributed SQL system like CockroachDB, which prioritizes serializable isolation, every read must consult the Raft group's leaseholder for the relevant data range. While a multi-region cluster provides high availability by replicating data across geographic locations, it doesn't inherently guarantee low-latency reads. A user in Sydney, Australia (ap-southeast-2) reading a row whose leaseholder resides in Ashburn, Virginia (us-east-1) will always pay the ~200ms round-trip time (RTT) penalty. Physics is a non-negotiable constraint.
Standard follower reads can reduce this latency but introduce data staleness, which is often unacceptable for critical user data like identity, entitlements, or account information. The core challenge remains: how do you deliver single-digit millisecond read latency to a global user base without compromising on strong consistency?
This is where geo-partitioning transcends from a documentation feature to a critical architectural pattern. By partitioning a table by a geographic locator (like region) and then pinning the Raft leaseholders for each partition to nodes within that specific geography, we can ensure that a user's reads are always served by a server in their local region. This article provides a deep, implementation-focused dive into this pattern, its significant performance trade-offs, and the complex edge cases you'll encounter in a production environment.
Foundational Concepts: Leaseholders and the Latency Tax
Before we manipulate them, let's be precise about the role of leaseholders. In CockroachDB's Raft implementation, each 64MB range of data has a leaseholder node. This node is responsible for coordinating all read and write operations for its range. While writes must achieve a quorum across the Raft group (often spanning multiple regions), consistent reads (SELECT queries at the default timestamp) can be served directly by the leaseholder without network hops to other replicas, provided it can confirm its lease is valid.
This is the key to performance. If the leaseholder is geographically close to the application server making the request, the read is fast. If it's on the other side of the world, the read is slow.
Scenario: The Unoptimized Global Read
Imagine a 3-region cluster (us-east-1, eu-west-1, ap-southeast-2) with a simple users table. CockroachDB's default behavior will attempt to balance leaseholders across the cluster to distribute the workload. A user in Sydney might get lucky and hit a local leaseholder, or they might not.
Let's visualize the unlucky path:
ap-southeast-2 issues SELECT * FROM users WHERE id = 'sydney-user-id';ap-southeast-2.'sydney-user-id' is currently in us-east-1.us-east-1 leaseholder.us-east-1 node serves the read from its local state.us-east-1 to ap-southeast-2.This round trip easily adds 200-250ms of latency to the query. For user-facing interactions, this is often unacceptable.
Implementing Geo-Partitioning: A Production Pattern
Our goal is to explicitly tell CockroachDB where data and its corresponding leaseholders should live. We will use a users table for a global application as our primary example.
Step 1: Multi-Region Database Setup
First, we establish our database's geographic footprint. This is done at the database level and informs CockroachDB about our desired data locality and survival goals.
-- Connect to your CockroachDB cluster
-- This command defines the regions the database will operate in.
-- The PRIMARY REGION is where non-regionalized data and metadata will live.
CREATE DATABASE global_app
PRIMARY REGION "us-east-1"
REGIONS "eu-west-1", "ap-southeast-2"
SURVIVAL GOAL REGION;
-- Switch to our new database
USE global_app;
PRIMARY REGION: Sets the default region. Non-partitioned tables and system data will have their leaseholders prefer this region.REGIONS: Adds other participating regions.SURVIVAL GOAL REGION: This is critical. It tells CockroachDB to configure replication so the database can survive the failure of an entire region.Step 2: Designing the Partitioned Table
The key to this pattern is including a column in your table that explicitly states the data's home region. We'll use an ENUM for type safety.
-- Create a type for our regions to enforce consistency.
CREATE TYPE region_enum AS ENUM ('us-east-1', 'eu-west-1', 'ap-southeast-2');
-- Create the users table with the critical 'region' column.
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
region region_enum NOT NULL,
email STRING NOT NULL UNIQUE,
profile JSONB,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
INDEX users_email_idx (email)
);
Every time a user is created, the application logic must determine their home region (e.g., via GeoIP lookup during signup, or user selection) and populate the region column. This column is the pivot for our entire strategy.
Step 3: Creating the Partitions
Now we alter the table to create a list partition based on the region column. Each value in our region_enum will correspond to a separate partition.
-- This DDL command splits the table's data into logical partitions.
-- At this point, the data is still replicated globally.
ALTER TABLE users PARTITION BY LIST (region) (
PARTITION us_east_1 VALUES IN ('us-east-1'),
PARTITION eu_west_1 VALUES IN ('eu-west-1'),
PARTITION ap_southeast_2 VALUES IN ('ap-southeast-2')
);
After this step, CockroachDB knows that rows with region = 'us-east-1' belong to the us_east_1 partition, but it hasn't yet changed where the data or leaseholders for that partition live. The data for all partitions is still replicated across all three regions to satisfy the SURVIVAL GOAL REGION setting.
Step 4: Pinning Leaseholders with Zone Configurations
This is the most critical step. We will apply a zone configuration to each partition, constraining its replicas and, most importantly, its leaseholder, to the desired region.
-- For US users, pin their data and leaseholders to us-east-1.
ALTER PARTITION us_east_1 OF TABLE users
CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{ "+region=us-east-1" }',
lease_preferences = '[[ "+region=us-east-1" ]]';
-- For European users, pin their data and leaseholders to eu-west-1.
ALTER PARTITION eu_west_1 OF TABLE users
CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{ "+region=eu-west-1" }',
lease_preferences = '[[ "+region=eu-west-1" ]]';
-- For Asia-Pacific users, pin their data and leaseholders to ap-southeast-2.
ALTER PARTITION ap_southeast_2 OF TABLE users
CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{ "+region=ap-southeast-2" }',
lease_preferences = '[[ "+region=ap-southeast-2" ]]';
Let's break down this command:
ALTER PARTITION ... CONFIGURE ZONE: We are applying a rule to a specific partition.num_replicas = 3: We maintain 3 replicas for HA within the region.constraints = '{ "+region=..." }': This is a hard requirement. It forces all 3 replicas (voters and non-voters) for this partition's data to live on nodes in the specified region. This localizes the data entirely.lease_preferences = '[[ "+region=..." ]]': This is an explicit instruction to the leaseholder rebalancing system. It tells CockroachDB to strongly prefer placing the leaseholder for this partition on a node in the specified region.With this configuration, the data for Australian users not only lives exclusively on nodes in ap-southeast-2, but the Raft leaseholder for that data is also guaranteed to be there. The latency tax is eliminated.
Verification and Performance Comparison
Let's prove the difference. Assume we have an application instance running in Sydney (ap-southeast-2).
Before Geo-Partitioning (Leaseholder in us-east-1):
-- Run from a client connected to a Sydney node
EXPLAIN ANALYZE SELECT * FROM users WHERE id = 'sydney-user-id';
info
------------------------------------------------------------------------------------------------
planning time: 234µs
execution time: 215ms
distribution: full
vectorized: false
rows read from KV: 1 (84 B)
cumulative time in KV: 214ms
KV round-trips: 1
... (output truncated)
The key metric is execution time: 215ms. This is the speed of light tax in action.
After Geo-Partitioning:
First, insert a user homed in Sydney:
INSERT INTO users (region, email, profile) VALUES ('ap-southeast-2', '[email protected]', '{}');
-- Let's say this returns id = 'e8a3f8b2-4d7e-4b1f-8c3d-9e6a7f8b9c0d'
Now, run the query from the same Sydney client:
EXPLAIN ANALYZE SELECT * FROM users WHERE id = 'e8a3f8b2-4d7e-4b1f-8c3d-9e6a7f8b9c0d';
info
------------------------------------------------------------------------------------------------
planning time: 180µs
execution time: 3ms
distribution: local
vectorized: false
rows read from KV: 1 (84 B)
cumulative time in KV: 2ms
KV round-trips: 1
... (output truncated)
The difference is staggering. execution time: 3ms. The distribution: local flag confirms that the gateway node was able to satisfy the query without making a network request to another node. We have achieved our goal of low-latency local reads.
Advanced Scenarios and Production Edge Cases
Geo-partitioning is not a silver bullet. It introduces significant complexity and trade-offs that senior engineers must architect around.
Edge Case 1: The Write Latency Penalty
We optimized for reads, but what about writes? When we constrained each partition's replicas to a single region, we broke our database's region-level survival goal for this table. A write for a Sydney user (INSERT, UPDATE, DELETE) now only needs to achieve Raft quorum among the nodes in ap-southeast-2. This makes writes fast locally but eliminates regional failure tolerance for that data.
To restore regional fault tolerance, we must modify our zone configs to store replicas in other regions.
-- Let's reconfigure the ap_southeast_2 partition for regional HA
ALTER PARTITION ap_southeast_2 OF TABLE users
CONFIGURE ZONE USING
num_replicas = 3,
-- Place one replica in each region
constraints = '{ "+region=ap-southeast-2": 1, "+region=us-east-1": 1, "+region=eu-west-1": 1 }',
-- But STRONGLY prefer the leaseholder stays in APAC
lease_preferences = '[[ "+region=ap-southeast-2" ]]';
-- Repeat for other partitions...
Now we have the best of both worlds: data is globally replicated for HA, and reads are served locally and quickly. But we have re-introduced a massive write penalty.
A write for a Sydney user must now follow this path:
ap-southeast-2 leaseholder.us-east-1 and eu-west-1.- It must wait for at least one of those followers to acknowledge the entry to achieve quorum (2 out of 3 replicas).
- This process will take at least the RTT to the fastest of the other two regions.
An INSERT that previously took 10ms might now take 150-250ms. This is the fundamental trade-off of this pattern.
Solutions:
Edge Case 2: Cross-Region Queries
What happens when a user in Europe needs to view the profile of a user in Australia?
-- Run from a client connected to an eu-west-1 node
SELECT * FROM users WHERE email = '[email protected]';
CockroachDB's distributed SQL optimizer will handle this, but it will be slow. The eu-west-1 gateway node will:
ap_southeast_2 partition.ap-southeast-2.- Make a remote procedure call (RPC) to that node to fetch the data.
The query will incur the full eu-west-1 <-> ap-southeast-2 RTT, resulting in ~300ms latency.
Solutions:
eu-west-1 can make an API call to a service instance in ap-southeast-2. This keeps the database query local to the data and moves the latency to the application network, which can be managed with timeouts, retries, and circuit breakers.CREATE TABLE user_global_directory (
id UUID PRIMARY KEY,
region region_enum NOT NULL,
display_name STRING,
avatar_url STRING
);
-- This table is NOT partitioned. Writes are slow, but follower reads are fast.
ALTER TABLE user_global_directory CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{ "+region=ap-southeast-2": 1, "+region=us-east-1": 1, "+region=eu-west-1": 1 }';
-- Enable stale reads for this table
ALTER TABLE user_global_directory CONFIGURE ZONE USING
gc.ttlseconds = 600,
global_reads_staleness = '10s';
Edge Case 3: Moving a User's Data
A user moves from London to Sydney. Their home region needs to change to ensure they continue to get low-latency reads. This requires an UPDATE statement.
UPDATE users SET region = 'ap-southeast-2' WHERE id = 'london-user-id';
This simple query triggers a complex background process in CockroachDB:
- The row is updated, changing the value of the partitioning column.
eu_west_1 partition.eu_west_1 partition to the correct range(s) in the ap_southeast_2 partition.- This involves a distributed delete from the old partition and a distributed insert into the new one, which can be resource-intensive.
This process is atomic and consistent, but it's not instantaneous. You can monitor the progress of data rebalancing through the CockroachDB DB Console's data distribution maps. For applications with high user mobility, you must account for the I/O and network overhead of these partition moves.
Edge Case 4: JOINs with Global Tables
Consider a scenario where you want to join your geo-partitioned users table with a non-partitioned, globally replicated products table.
-- products table is not partitioned, its leaseholders are likely in the primary region (us-east-1)
SELECT u.email, p.product_name
FROM users u
JOIN purchases pur ON u.id = pur.user_id
JOIN products p ON pur.product_id = p.id
WHERE u.id = 'sydney-user-id';
CockroachDB's cost-based optimizer is sophisticated, but this query can be a performance minefield. The optimizer has to choose a plan:
user row from the local ap-southeast-2 partition, then for each associated purchase, send a request to wherever the products data lives (likely us-east-1). This is a high-latency N+1 query at the database level.products data to the ap-southeast-2 node and perform the join locally. This can be slow if the products table is large.Solution:
EXPLAIN Plans: Always, always run EXPLAIN on these types of queries. Understand the join strategy the optimizer is choosing.Conclusion: A Powerful Tool for a Specific Problem
Geo-partitioning in CockroachDB is an expert-level feature that directly addresses the challenge of providing low-latency reads in a consistent, globally distributed system. By co-locating data partitions and their Raft leaseholders with your users, you can eliminate cross-continent network hops for your most critical read paths, reducing latency from hundreds of milliseconds to single digits.
However, this power comes with significant architectural trade-offs:
JOINs adds to the system's operational burden.This pattern is not for every table. It is best suited for read-heavy data that is strongly affiliated with a specific geography, such as user profiles, regional settings, or content tied to a specific country. For truly global, write-heavy data, other patterns may be more appropriate. By understanding the deep implementation details and carefully navigating the edge cases, you can wield geo-partitioning to build truly high-performance, resilient, and globally scalable applications.