Global Low-Latency Reads with CockroachDB's Geo-Partitioning

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inescapable Problem: Physics vs. Global Consistency

As a senior engineer architecting a global application, you've inevitably faced the trade-off between consistency and latency. In a distributed SQL system like CockroachDB, which prioritizes serializable isolation, every read must consult the Raft group's leaseholder for the relevant data range. While a multi-region cluster provides high availability by replicating data across geographic locations, it doesn't inherently guarantee low-latency reads. A user in Sydney, Australia (ap-southeast-2) reading a row whose leaseholder resides in Ashburn, Virginia (us-east-1) will always pay the ~200ms round-trip time (RTT) penalty. Physics is a non-negotiable constraint.

Standard follower reads can reduce this latency but introduce data staleness, which is often unacceptable for critical user data like identity, entitlements, or account information. The core challenge remains: how do you deliver single-digit millisecond read latency to a global user base without compromising on strong consistency?

This is where geo-partitioning transcends from a documentation feature to a critical architectural pattern. By partitioning a table by a geographic locator (like region) and then pinning the Raft leaseholders for each partition to nodes within that specific geography, we can ensure that a user's reads are always served by a server in their local region. This article provides a deep, implementation-focused dive into this pattern, its significant performance trade-offs, and the complex edge cases you'll encounter in a production environment.


Foundational Concepts: Leaseholders and the Latency Tax

Before we manipulate them, let's be precise about the role of leaseholders. In CockroachDB's Raft implementation, each 64MB range of data has a leaseholder node. This node is responsible for coordinating all read and write operations for its range. While writes must achieve a quorum across the Raft group (often spanning multiple regions), consistent reads (SELECT queries at the default timestamp) can be served directly by the leaseholder without network hops to other replicas, provided it can confirm its lease is valid.

This is the key to performance. If the leaseholder is geographically close to the application server making the request, the read is fast. If it's on the other side of the world, the read is slow.

Scenario: The Unoptimized Global Read

Imagine a 3-region cluster (us-east-1, eu-west-1, ap-southeast-2) with a simple users table. CockroachDB's default behavior will attempt to balance leaseholders across the cluster to distribute the workload. A user in Sydney might get lucky and hit a local leaseholder, or they might not.

Let's visualize the unlucky path:

  • Client Request (Sydney): An application server in ap-southeast-2 issues SELECT * FROM users WHERE id = 'sydney-user-id';
  • Gateway Node (Sydney): The connection hits a local CockroachDB node in ap-southeast-2.
  • Leaseholder Lookup: This gateway node determines that the leaseholder for the data range containing 'sydney-user-id' is currently in us-east-1.
  • Cross-Continent Hop: The Sydney node forwards the read request to the us-east-1 leaseholder.
  • Leaseholder Serves Read: The us-east-1 node serves the read from its local state.
  • Cross-Continent Hop (Return): The result is sent back from us-east-1 to ap-southeast-2.
  • Client Response (Sydney): The gateway node returns the result to the application.
  • This round trip easily adds 200-250ms of latency to the query. For user-facing interactions, this is often unacceptable.

    Implementing Geo-Partitioning: A Production Pattern

    Our goal is to explicitly tell CockroachDB where data and its corresponding leaseholders should live. We will use a users table for a global application as our primary example.

    Step 1: Multi-Region Database Setup

    First, we establish our database's geographic footprint. This is done at the database level and informs CockroachDB about our desired data locality and survival goals.

    sql
    -- Connect to your CockroachDB cluster
    -- This command defines the regions the database will operate in.
    -- The PRIMARY REGION is where non-regionalized data and metadata will live.
    CREATE DATABASE global_app
        PRIMARY REGION "us-east-1"
        REGIONS "eu-west-1", "ap-southeast-2"
        SURVIVAL GOAL REGION;
    
    -- Switch to our new database
    USE global_app;
  • PRIMARY REGION: Sets the default region. Non-partitioned tables and system data will have their leaseholders prefer this region.
  • REGIONS: Adds other participating regions.
  • SURVIVAL GOAL REGION: This is critical. It tells CockroachDB to configure replication so the database can survive the failure of an entire region.
  • Step 2: Designing the Partitioned Table

    The key to this pattern is including a column in your table that explicitly states the data's home region. We'll use an ENUM for type safety.

    sql
    -- Create a type for our regions to enforce consistency.
    CREATE TYPE region_enum AS ENUM ('us-east-1', 'eu-west-1', 'ap-southeast-2');
    
    -- Create the users table with the critical 'region' column.
    CREATE TABLE users (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        region region_enum NOT NULL,
        email STRING NOT NULL UNIQUE,
        profile JSONB,
        created_at TIMESTAMPTZ DEFAULT now(),
        updated_at TIMESTAMPTZ DEFAULT now(),
        INDEX users_email_idx (email)
    );

    Every time a user is created, the application logic must determine their home region (e.g., via GeoIP lookup during signup, or user selection) and populate the region column. This column is the pivot for our entire strategy.

    Step 3: Creating the Partitions

    Now we alter the table to create a list partition based on the region column. Each value in our region_enum will correspond to a separate partition.

    sql
    -- This DDL command splits the table's data into logical partitions.
    -- At this point, the data is still replicated globally.
    ALTER TABLE users PARTITION BY LIST (region) (
        PARTITION us_east_1 VALUES IN ('us-east-1'),
        PARTITION eu_west_1 VALUES IN ('eu-west-1'),
        PARTITION ap_southeast_2 VALUES IN ('ap-southeast-2')
    );

    After this step, CockroachDB knows that rows with region = 'us-east-1' belong to the us_east_1 partition, but it hasn't yet changed where the data or leaseholders for that partition live. The data for all partitions is still replicated across all three regions to satisfy the SURVIVAL GOAL REGION setting.

    Step 4: Pinning Leaseholders with Zone Configurations

    This is the most critical step. We will apply a zone configuration to each partition, constraining its replicas and, most importantly, its leaseholder, to the desired region.

    sql
    -- For US users, pin their data and leaseholders to us-east-1.
    ALTER PARTITION us_east_1 OF TABLE users 
        CONFIGURE ZONE USING 
        num_replicas = 3, 
        constraints = '{ "+region=us-east-1" }', 
        lease_preferences = '[[ "+region=us-east-1" ]]';
    
    -- For European users, pin their data and leaseholders to eu-west-1.
    ALTER PARTITION eu_west_1 OF TABLE users 
        CONFIGURE ZONE USING 
        num_replicas = 3, 
        constraints = '{ "+region=eu-west-1" }', 
        lease_preferences = '[[ "+region=eu-west-1" ]]';
    
    -- For Asia-Pacific users, pin their data and leaseholders to ap-southeast-2.
    ALTER PARTITION ap_southeast_2 OF TABLE users 
        CONFIGURE ZONE USING 
        num_replicas = 3, 
        constraints = '{ "+region=ap-southeast-2" }', 
        lease_preferences = '[[ "+region=ap-southeast-2" ]]';

    Let's break down this command:

  • ALTER PARTITION ... CONFIGURE ZONE: We are applying a rule to a specific partition.
  • num_replicas = 3: We maintain 3 replicas for HA within the region.
  • constraints = '{ "+region=..." }': This is a hard requirement. It forces all 3 replicas (voters and non-voters) for this partition's data to live on nodes in the specified region. This localizes the data entirely.
  • lease_preferences = '[[ "+region=..." ]]': This is an explicit instruction to the leaseholder rebalancing system. It tells CockroachDB to strongly prefer placing the leaseholder for this partition on a node in the specified region.
  • With this configuration, the data for Australian users not only lives exclusively on nodes in ap-southeast-2, but the Raft leaseholder for that data is also guaranteed to be there. The latency tax is eliminated.

    Verification and Performance Comparison

    Let's prove the difference. Assume we have an application instance running in Sydney (ap-southeast-2).

    Before Geo-Partitioning (Leaseholder in us-east-1):

    sql
    -- Run from a client connected to a Sydney node
    EXPLAIN ANALYZE SELECT * FROM users WHERE id = 'sydney-user-id';
    text
                                                info
    ------------------------------------------------------------------------------------------------
      planning time: 234µs
      execution time: 215ms
      distribution: full
      vectorized: false
      rows read from KV: 1 (84 B)
      cumulative time in KV: 214ms
      KV round-trips: 1
      ... (output truncated)

    The key metric is execution time: 215ms. This is the speed of light tax in action.

    After Geo-Partitioning:

    First, insert a user homed in Sydney:

    sql
    INSERT INTO users (region, email, profile) VALUES ('ap-southeast-2', '[email protected]', '{}');
    -- Let's say this returns id = 'e8a3f8b2-4d7e-4b1f-8c3d-9e6a7f8b9c0d'

    Now, run the query from the same Sydney client:

    sql
    EXPLAIN ANALYZE SELECT * FROM users WHERE id = 'e8a3f8b2-4d7e-4b1f-8c3d-9e6a7f8b9c0d';
    text
                                                info
    ------------------------------------------------------------------------------------------------
      planning time: 180µs
      execution time: 3ms
      distribution: local
      vectorized: false
      rows read from KV: 1 (84 B)
      cumulative time in KV: 2ms
      KV round-trips: 1
      ... (output truncated)

    The difference is staggering. execution time: 3ms. The distribution: local flag confirms that the gateway node was able to satisfy the query without making a network request to another node. We have achieved our goal of low-latency local reads.


    Advanced Scenarios and Production Edge Cases

    Geo-partitioning is not a silver bullet. It introduces significant complexity and trade-offs that senior engineers must architect around.

    Edge Case 1: The Write Latency Penalty

    We optimized for reads, but what about writes? When we constrained each partition's replicas to a single region, we broke our database's region-level survival goal for this table. A write for a Sydney user (INSERT, UPDATE, DELETE) now only needs to achieve Raft quorum among the nodes in ap-southeast-2. This makes writes fast locally but eliminates regional failure tolerance for that data.

    To restore regional fault tolerance, we must modify our zone configs to store replicas in other regions.

    sql
    -- Let's reconfigure the ap_southeast_2 partition for regional HA
    ALTER PARTITION ap_southeast_2 OF TABLE users 
        CONFIGURE ZONE USING 
        num_replicas = 3, 
        -- Place one replica in each region
        constraints = '{ "+region=ap-southeast-2": 1, "+region=us-east-1": 1, "+region=eu-west-1": 1 }', 
        -- But STRONGLY prefer the leaseholder stays in APAC
        lease_preferences = '[[ "+region=ap-southeast-2" ]]';
    
    -- Repeat for other partitions...

    Now we have the best of both worlds: data is globally replicated for HA, and reads are served locally and quickly. But we have re-introduced a massive write penalty.

    A write for a Sydney user must now follow this path:

  • Request hits the ap-southeast-2 leaseholder.
  • The leaseholder sends Raft log entries to the follower replicas in us-east-1 and eu-west-1.
    • It must wait for at least one of those followers to acknowledge the entry to achieve quorum (2 out of 3 replicas).
    • This process will take at least the RTT to the fastest of the other two regions.

    An INSERT that previously took 10ms might now take 150-250ms. This is the fundamental trade-off of this pattern.

    Solutions:

  • Accept It: For many use cases (updating a user profile, posting a comment), 250ms write latency is perfectly acceptable.
  • Asynchronous Writes: The application can write to a message queue (Kafka, SQS) and have a background worker perform the database write. The UI can then update optimistically.
  • CQRS (Command Query Responsibility Segregation): A more complex architectural pattern where the write model is separated from the read model. Writes go to a different service/datastore optimized for ingestion, and the data is eventually propagated to the geo-partitioned read model.
  • Edge Case 2: Cross-Region Queries

    What happens when a user in Europe needs to view the profile of a user in Australia?

    sql
    -- Run from a client connected to an eu-west-1 node
    SELECT * FROM users WHERE email = '[email protected]';

    CockroachDB's distributed SQL optimizer will handle this, but it will be slow. The eu-west-1 gateway node will:

  • Determine the query needs data from the ap_southeast_2 partition.
  • Identify the leaseholder for that partition in ap-southeast-2.
    • Make a remote procedure call (RPC) to that node to fetch the data.

    The query will incur the full eu-west-1 <-> ap-southeast-2 RTT, resulting in ~300ms latency.

    Solutions:

  • Service-to-Service Calls: An application service in eu-west-1 can make an API call to a service instance in ap-southeast-2. This keeps the database query local to the data and moves the latency to the application network, which can be managed with timeouts, retries, and circuit breakers.
  • Replicated Data Caches: Use a caching layer like Redis that has a presence in each region, caching profiles for a short TTL.
  • Duplicate Global Tables: For data that is frequently accessed across regions but rarely updated (e.g., user's name and avatar URL), you can maintain a second, non-partitioned global table that is replicated everywhere. This table would have follower reads enabled to provide fast, slightly stale reads for this non-critical data.
  • sql
    CREATE TABLE user_global_directory (
        id UUID PRIMARY KEY,
        region region_enum NOT NULL,
        display_name STRING,
        avatar_url STRING
    );
    
    -- This table is NOT partitioned. Writes are slow, but follower reads are fast.
    ALTER TABLE user_global_directory CONFIGURE ZONE USING
      num_replicas = 3,
      constraints = '{ "+region=ap-southeast-2": 1, "+region=us-east-1": 1, "+region=eu-west-1": 1 }';
    
    -- Enable stale reads for this table
    ALTER TABLE user_global_directory CONFIGURE ZONE USING
        gc.ttlseconds = 600,
        global_reads_staleness = '10s';

    Edge Case 3: Moving a User's Data

    A user moves from London to Sydney. Their home region needs to change to ensure they continue to get low-latency reads. This requires an UPDATE statement.

    sql
    UPDATE users SET region = 'ap-southeast-2' WHERE id = 'london-user-id';

    This simple query triggers a complex background process in CockroachDB:

    • The row is updated, changing the value of the partitioning column.
  • CockroachDB detects the row no longer belongs in the eu_west_1 partition.
  • It initiates a Raft-level operation to move the row's data from the range(s) belonging to the eu_west_1 partition to the correct range(s) in the ap_southeast_2 partition.
    • This involves a distributed delete from the old partition and a distributed insert into the new one, which can be resource-intensive.

    This process is atomic and consistent, but it's not instantaneous. You can monitor the progress of data rebalancing through the CockroachDB DB Console's data distribution maps. For applications with high user mobility, you must account for the I/O and network overhead of these partition moves.

    Edge Case 4: JOINs with Global Tables

    Consider a scenario where you want to join your geo-partitioned users table with a non-partitioned, globally replicated products table.

    sql
    -- products table is not partitioned, its leaseholders are likely in the primary region (us-east-1)
    SELECT u.email, p.product_name
    FROM users u
    JOIN purchases pur ON u.id = pur.user_id
    JOIN products p ON pur.product_id = p.id
    WHERE u.id = 'sydney-user-id';

    CockroachDB's cost-based optimizer is sophisticated, but this query can be a performance minefield. The optimizer has to choose a plan:

  • Lookup Join: It can fetch the user row from the local ap-southeast-2 partition, then for each associated purchase, send a request to wherever the products data lives (likely us-east-1). This is a high-latency N+1 query at the database level.
  • Hash Join: It might try to pull all relevant products data to the ap-southeast-2 node and perform the join locally. This can be slow if the products table is large.
  • Locality Optimized Joins: In newer versions of CockroachDB, the optimizer is aware of data locality and will try to schedule join processing on the nodes that hold the most data, but it's not a magic fix.
  • Solution:

  • Analyze EXPLAIN Plans: Always, always run EXPLAIN on these types of queries. Understand the join strategy the optimizer is choosing.
  • Application-Side Joins: Often, the most performant solution is to break it into two separate queries in your application. First, fetch the user and their purchases locally. Then, make a second batch query to fetch all the necessary product details. This gives you explicit control over the network hops.
  • Conclusion: A Powerful Tool for a Specific Problem

    Geo-partitioning in CockroachDB is an expert-level feature that directly addresses the challenge of providing low-latency reads in a consistent, globally distributed system. By co-locating data partitions and their Raft leaseholders with your users, you can eliminate cross-continent network hops for your most critical read paths, reducing latency from hundreds of milliseconds to single digits.

    However, this power comes with significant architectural trade-offs:

  • The Write Latency Tax: Achieving regional fault tolerance for your partitioned data means every write must pay the price of cross-region Raft consensus.
  • Cross-Region Complexity: Queries that span partitions are inherently slow, requiring careful architectural consideration at the application or service level.
  • Operational Overhead: Managing data movement for users who change regions and analyzing complex distributed JOINs adds to the system's operational burden.
  • This pattern is not for every table. It is best suited for read-heavy data that is strongly affiliated with a specific geography, such as user profiles, regional settings, or content tied to a specific country. For truly global, write-heavy data, other patterns may be more appropriate. By understanding the deep implementation details and carefully navigating the edge cases, you can wield geo-partitioning to build truly high-performance, resilient, and globally scalable applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles