Vector DB Sharding for Billion-Scale Cosine Similarity Search

September 30, 2025

23 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Wall: When Vertical Scaling for Vector Search Fails

In the lifecycle of any successful AI-powered application, the vector database eventually hits a scaling wall. Initially, a single, powerful node running a vector database like Milvus, Weaviate, or a PostgreSQL instance with pg_vector seems sufficient. But as your dataset of embeddings grows from millions to hundreds of millions, and then pushes towards the billion-vector mark, the single-node architecture begins to fracture under pressure.

The primary culprit is not CPU, but RAM. State-of-the-art Approximate Nearest Neighbor (ANN) indexes, particularly graph-based algorithms like HNSW (Hierarchical Navigable Small World), demand that the entire index and, often, the vectors themselves reside in memory for low-latency queries. A billion 768-dimensional float32 vectors alone consume 1,000,000,000 768 4 bytes ≈ 3.072 TB of storage. The HNSW index built on top of this can easily double or triple that memory requirement, pushing you far beyond the capacity of even the most expensive cloud instances.

Vertical scaling—throwing more RAM and CPU at the problem—becomes economically unviable and technically capped. The only path forward is horizontal scaling: sharding. However, sharding a vector database is fundamentally different and more complex than sharding a traditional relational database. In a relational world, queries are typically directed to a specific shard via a primary key. In the vector world, a similarity search query (find the k-nearest neighbors to vector V) has no inherent shard key. The nearest neighbors could exist on any shard, leading to the central challenge we will dissect: the query fan-out problem.

This article bypasses introductory concepts and dives straight into the architectural trade-offs and implementation patterns for sharding vector workloads at scale. We will analyze three core strategies, their performance characteristics, and the complex edge cases they introduce.

Strategy 1: Algorithmic Sharding (The Naive Approach)

Algorithmic sharding is the most straightforward method. It uses a consistent hashing algorithm on the vector's unique ID to distribute data across a fixed number of shards. This approach is excellent for ensuring a statistically even distribution of data, making it a common first step in scaling write operations.

Implementation: Client-Side Routing Logic

A client-side router is responsible for directing writes and reads. For writes, the logic is simple: hash the vector's ID and send it to the corresponding shard.

python

import hashlib
import uuid

# Assume we have a list of shard connections (e.g., gRPC clients for Milvus)
SHARD_ENDPOINTS = ["milvus-shard-0:19530", "milvus-shard-1:19530", "milvus-shard-2:19530", "milvus-shard-3:19530"]
NUM_SHARDS = len(SHARD_ENDPOINTS)

class AlgorithmicShardingClient:
    def __init__(self, connections):
        self.shards = connections # Simplified: In reality, these would be connection objects

    def _get_shard_index(self, vector_id: str) -> int:
        """Determines the shard index using a consistent hash of the vector ID."""
        # Using SHA-256 for a good distribution
        hasher = hashlib.sha256(vector_id.encode('utf-8'))
        # Use modulo to map the hash to a shard index
        return int(hasher.hexdigest(), 16) % NUM_SHARDS

    def insert(self, vector: list[float], vector_id: str = None):
        if vector_id is None:
            vector_id = str(uuid.uuid4())
        
        shard_index = self._get_shard_index(vector_id)
        target_shard = self.shards[shard_index]
        
        print(f"[WRITE] Routing vector_id '{vector_id}' to shard {shard_index} ({target_shard})")
        # In a real implementation:
        # target_shard_connection = self.get_connection(target_shard)
        # target_shard_connection.insert(collection_name="products", data=[{"id": vector_id, "vector": vector}])
        return vector_id

# --- Usage ---
# client = AlgorithmicShardingClient(SHARD_ENDPOINTS)
# client.insert(vector=[0.1, ..., 0.9], vector_id="product-abc-123")
# client.insert(vector=[0.2, ..., 0.8], vector_id="product-def-456")

The Catastrophic Flaw: Read Amplification and Query Fan-Out

The simplicity of write operations masks a devastating performance problem for reads. Since the nearest neighbors to a query vector could be on any shard, the client has no choice but to send the search request to every single shard simultaneously. This is the query fan-out problem.

python

import concurrent.futures

class AlgorithmicShardingClient:
    # ... (init and _get_shard_index from before)

    def _search_shard(self, shard_index: int, query_vector: list[float], k: int) -> list[dict]:
        """Simulates sending a search request to a single shard."""
        shard_endpoint = self.shards[shard_index]
        print(f"[READ] Querying shard {shard_index} ({shard_endpoint}) for top {k} neighbors...")
        # In a real implementation:
        # shard_conn = self.get_connection(shard_endpoint)
        # results = shard_conn.search(collection_name="products", data=[query_vector], limit=k)
        # return results[0] # Assuming results for one query vector
        # For simulation, we return dummy data:
        return [{'id': f'shard{shard_index}-doc{i}', 'distance': 0.1 * (i + 1) + shard_index} for i in range(k)]

    def search(self, query_vector: list[float], k: int):
        """Performs a search by fanning out to all shards and merging results."""
        print(f"\n--- Starting Global Search for Top {k} ---")
        shard_results = []
        with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_SHARDS) as executor:
            future_to_shard = {executor.submit(self._search_shard, i, query_vector, k): i for i in range(NUM_SHARDS)}
            for future in concurrent.futures.as_completed(future_to_shard):
                try:
                    shard_results.extend(future.result())
                except Exception as exc:
                    shard_index = future_to_shard[future]
                    print(f"Shard {shard_index} generated an exception: {exc}")

        # The critical, expensive merge step
        print(f"\n[MERGE] Merging {len(shard_results)} results from {NUM_SHARDS} shards...")
        # Sort all results by distance (assuming smaller is better, e.g., Cosine Distance)
        sorted_results = sorted(shard_results, key=lambda x: x['distance'])
        
        # Return the true global top k
        return sorted_results[:k]

# --- Usage ---
# client = AlgorithmicShardingClient(SHARD_ENDPOINTS)
# top_k_results = client.search(query_vector=[0.5, ..., 0.5], k=10)
# print(f"\n--- Final Top 10 Results ---\n{top_k_results}")

Performance Analysis:

* Network Overhead: The query router becomes a firehose, sending the same query payload to N shards. This saturates network bandwidth.

* CPU Cost: Each of the N shards independently performs a costly ANN search. This is a massive duplication of effort.

* Tail Latency: The overall query latency is determined by the slowest shard. If one shard is experiencing high load or a network hiccup, the entire request is delayed. This makes P99 latency unpredictable and difficult to control.

Merge Cost: The router receives N k results and must perform an in-memory sort to find the true top k. While fast for small N and k, this aggregation step adds latency and becomes a CPU bottleneck on the client/router itself as the number of shards grows.

Verdict: Algorithmic sharding is suitable only for write-heavy workloads or systems with a very small number of shards (e.g., 2-4). For read-heavy, low-latency applications, it does not scale.

Strategy 2: Metadata-Driven Sharding (The Targeted Approach)

This strategy leverages an important observation: most real-world vector search queries are not pure vector similarity searches. They are hybrid queries that include a metadata filter, such as tenant_id, product_category, user_region, or is_public.

Metadata-driven sharding uses the value of a specific metadata field as the shard key. This allows the query router to bypass fan-out entirely and target a single shard (or a small subset of shards) that is known to contain the relevant data.

Implementation: A Multi-Tenant E-Commerce Scenario

Consider an e-commerce platform with millions of products. We can shard the product image vectors by category_id. Each category maps to a specific shard.

python

# This mapping could be stored in a config file, a database, or a service like Zookeeper/etcd
CATEGORY_TO_SHARD_MAP = {
    "electronics": 0,
    "apparel": 1,
    "home_goods": 2,
    "books": 3,
    # ... other categories
}
DEFAULT_SHARD = 0 # Fallback for unknown categories

class MetadataShardingClient:
    def __init__(self, connections, mapping):
        self.shards = connections
        self.mapping = mapping

    def _get_shard_index_from_metadata(self, metadata: dict) -> int:
        category = metadata.get("category_id")
        if not category:
            raise ValueError("Missing 'category_id' in metadata for sharding.")
        return self.mapping.get(category, DEFAULT_SHARD)

    def insert(self, vector: list[float], metadata: dict, vector_id: str = None):
        if vector_id is None:
            vector_id = str(uuid.uuid4())
        
        shard_index = self._get_shard_index_from_metadata(metadata)
        target_shard = self.shards[shard_index]
        print(f"[WRITE] Routing vector_id '{vector_id}' (category: {metadata['category_id']}) to shard {shard_index}")
        # ... actual insert logic ...

    def search(self, query_vector: list[float], k: int, filters: dict):
        """Performs a targeted search if the shard key is in the filters."""
        print(f"\n--- Starting Metadata Search for Top {k} with filters: {filters} ---")
        shard_key_value = filters.get("category_id")

        if shard_key_value:
            # This is the ideal case: we can target a single shard.
            shard_index = self.mapping.get(shard_key_value, DEFAULT_SHARD)
            print(f"[READ] Shard key '{shard_key_value}' found. Targeting shard {shard_index}.")
            # Delegate to a single shard search method
            return self._search_shard(shard_index, query_vector, k)
        else:
            # This is the problematic case: no shard key provided.
            print("[READ] No shard key found in filters. Falling back to full fan-out!")
            # We must revert to the inefficient algorithmic sharding read pattern.
            # (Implementation omitted for brevity, but it's the same as the previous example)
            raise NotImplementedError("Full fan-out search required.")

    def _search_shard(self, shard_index: int, query_vector: list[float], k: int) -> list[dict]:
        # Same as before, but now it's called selectively.
        shard_endpoint = self.shards[shard_index]
        print(f"Querying single shard {shard_index} ({shard_endpoint}) for top {k} neighbors...")
        return [{'id': f'shard{shard_index}-doc{i}', 'distance': 0.1 * (i + 1)} for i in range(k)]

# --- Usage ---
# client = MetadataShardingClient(SHARD_ENDPOINTS, CATEGORY_TO_SHARD_MAP)
# client.insert([0.1]*768, {"category_id": "electronics"}, "prod-123")
# client.insert([0.2]*768, {"category_id": "apparel"}, "prod-456")

# Efficient, targeted query
# results = client.search([0.11]*768, k=10, filters={"category_id": "electronics"})
# print(f"\n--- Final Results ---\n{results}")

Performance Analysis:

* Read Latency: For queries that include the shard key, latency is drastically reduced. It's equivalent to querying a single, smaller vector database. P99 latency is stable and predictable.

* Throughput: Since each query only hits one shard, the total system throughput is roughly N times that of a single node (where N is the number of shards), as shards operate independently.

* Isolation: Tenants or categories are isolated from each other. A search spike in electronics will not impact the performance of searches in apparel.

Edge Cases and Production Challenges

While highly effective, this pattern introduces new, complex problems:

The Hot Shard Problem: What if the electronics category has 500 million products, while books only has 1 million? The electronics shard will become a massive performance bottleneck, overloaded with data and queries, while the books shard sits idle. The data distribution is skewed, defeating the purpose of sharding.

Cross-Shard Queries: What happens when a user performs a search without filtering by category? The router has no choice but to fall back to the inefficient fan-out strategy, querying all shards and destroying performance.

Shard Key Immutability: Once a vector is assigned to a shard based on its category, changing that category becomes a complex distributed transaction: delete from the old shard, insert into the new one. This can lead to consistency issues.

Shard Management: As new categories are added, the CATEGORY_TO_SHARD_MAP must be updated. Rebalancing a hot shard (e.g., splitting the electronics shard into electronics-phones and electronics-laptops) requires a complex and carefully managed data migration process.

Verdict: Metadata-driven sharding is a powerful, high-performance pattern for applications with predictable, filtered query patterns. However, it is vulnerable to data skew (hot shards) and performs poorly for queries that lack the shard key.

Strategy 3: Hybrid Two-Tier Sharding (The Production-Grade Pattern)

To build a truly resilient, scalable system, we must combine the strengths of the previous two approaches. A hybrid, two-tier architecture uses metadata-driven sharding as the primary layer to route queries to a group of shards, and then uses algorithmic sharding within that group to ensure even data distribution and prevent hot spots.

This architecture effectively solves the hot shard problem. If the electronics category is 10 times larger than apparel, you can simply assign it a shard group with 10 physical nodes, while apparel gets a group with one node. The system remains balanced.

Architecture Overview

text

                               +-------------------------+
                               |      Query Router       |
                               +-------------------------+
                                           |
                                           | (Inspects metadata filter, e.g., 'category_id')
           +-------------------------------------------------------------------------+
           |                                 |                                         |
+-----------------------+      +-------------------------+      +-------------------------+
|   Shard Group A       |      |    Shard Group B        |      |    Shard Group C        |
| (Category: electronics) |      | (Category: apparel)     |      | (Category: books)       |
|   (4 physical nodes)  |      |   (1 physical node)     |      |   (1 physical node)     |
+-----------------------+      +-------------------------+      +-------------------------+
           |                              |
           | (Hash of vector_id)          | (No hash needed, only 1 node)
           v                              v
  [Node A0] [Node A1] [Node A2] [Node A3]    [Node B0]                      [Node C0]

Implementation: Two-Tier Routing Logic

The client-side router now becomes significantly more complex. It needs a multi-level mapping configuration and must handle both targeted and fan-out queries intelligently.

python

# Advanced configuration defining shard groups
SHARD_GROUP_CONFIG = {
    "groups": {
        "electronics": ["shard-e-0", "shard-e-1", "shard-e-2", "shard-e-3"], # Hot shard gets 4 nodes
        "apparel":     ["shard-a-0"],
        "home_goods":  ["shard-h-0", "shard-h-1"], # Medium shard gets 2 nodes
        "books":       ["shard-b-0"]
    },
    "default_group": "electronics" # Where to route data with unknown categories
}

class HybridShardingClient:
    def __init__(self, config):
        self.config = config
        # In a real system, this would manage connection pools to all physical shards
        self.all_shards = [shard for group in config['groups'].values() for shard in group]

    def _get_shard_group(self, metadata: dict) -> list[str]:
        category = metadata.get("category_id")
        return self.config['groups'].get(category, self.config['groups'][self.config['default_group']])

    def _get_target_shard_in_group(self, group: list[str], vector_id: str) -> str:
        """Algorithmic sharding *within* a group."""
        if not group:
            raise ValueError("Cannot determine target shard for an empty group.")
        num_sub_shards = len(group)
        hasher = hashlib.sha256(vector_id.encode('utf-8'))
        sub_shard_index = int(hasher.hexdigest(), 16) % num_sub_shards
        return group[sub_shard_index]

    def insert(self, vector: list[float], metadata: dict, vector_id: str = None):
        vector_id = vector_id or str(uuid.uuid4())
        group = self._get_shard_group(metadata)
        target_shard = self._get_target_shard_in_group(group, vector_id)
        print(f"[WRITE] Vector '{vector_id}' (Category: {metadata['category_id']}) -> Group '{metadata['category_id']}' -> Shard '{target_shard}'")
        # ... insert logic for the specific shard ...

    def search(self, query_vector: list[float], k: int, filters: dict):
        """Smart search: targets a group if possible, otherwise fans out to all groups."""
        category = filters.get("category_id")
        
        if category:
            # Targeted search: fan-out only within the specific group
            target_group = self.config['groups'].get(category)
            if not target_group:
                print(f"Warning: Category '{category}' not found, falling back to full scan.")
                return self._search_all_shards(query_vector, k)
            
            print(f"[READ] Targeted search in group '{category}' ({len(target_group)} nodes)")
            return self._search_shard_group(target_group, query_vector, k)
        else:
            # Global search: must fan-out to every single shard in the system
            print("[READ] No category filter. Fanning out to all {len(self.all_shards)} shards.")
            return self._search_all_shards(query_vector, k)

    def _search_shard_group(self, group: list[str], query_vector: list[float], k: int):
        # This is essentially the fan-out logic from Strategy 1, but scoped to a smaller set of shards.
        # This parallelism is now manageable and performant.
        # ... (concurrently query all shards in 'group', then merge results) ...
        print(f"Fanning out to {len(group)} shards and merging {len(group) * k} results.")
        # Dummy implementation
        all_results = []
        for shard in group:
            all_results.extend([{'id': f'{shard}-doc{i}', 'distance': 0.1 * (i + 1)} for i in range(k)])
        return sorted(all_results, key=lambda x: x['distance'])[:k]

    def _search_all_shards(self, query_vector: list[float], k: int):
        return self._search_shard_group(self.all_shards, query_vector, k)

# --- Usage ---
# client = HybridShardingClient(SHARD_GROUP_CONFIG)
# A query for a large category fans out to a managed, parallelized group
# results_electronics = client.search([0.1]*768, k=10, filters={"category_id": "electronics"})
# A query for a small category hits just one node
# results_apparel = client.search([0.2]*768, k=10, filters={"category_id": "apparel"})

Performance and Scalability Analysis

This hybrid model provides a robust balance of performance and scalability:

* Eliminates Hot Shards: By allocating more physical nodes to larger data partitions (categories), the load is distributed evenly. The write distribution within the group is handled by the hashing function.

* Controlled Fan-Out: For the most common, filtered queries, fan-out is limited to a small, manageable number of nodes within a group. This keeps latency low and predictable.

* Elastic Scalability: To scale up a hot category, you can add more nodes to its group and trigger a background rebalancing of data within that group (based on the new modulo). This operation is isolated and does not affect other categories.

* Graceful Degradation: While global, unfiltered queries are still expensive, the architecture is designed to optimize for the 99% of queries that are filtered. The performance of the fallback case is a known and accepted trade-off.

Final Considerations: The Unseen Complexities

Implementing a sharded vector database involves more than just routing logic. Senior engineers must also contend with:

Result Merging Accuracy: When fetching k results from N shards, you are not guaranteed to find the true global top k. It's possible the k+1-th best item on shard 1 is actually better than the k-th best item on shard 2. To get perfect accuracy, you must fetch k from each shard, merge Nk results, and re-rank. For approximate accuracy with better performance, you can fetch a larger number, say k1.5, from each shard, giving you a higher probability of finding the true top k in the merged set.

Replication and High Availability: Each physical shard must be replicated to handle node failures. This adds another layer of complexity to the infrastructure, typically managed by Kubernetes operators or native cloud services.

Configuration Management: The shard mapping configuration is a critical piece of infrastructure. It must be stored in a highly available system like etcd or ZooKeeper and updated dynamically as the cluster topology changes.

Rebalancing Operations: A production system needs robust, automated tooling to handle rebalancing. This involves adding a node to a group, streaming a portion of the data from existing nodes to the new one, updating the routing configuration, and gracefully draining connections to old configurations—all without downtime.

Conclusion

Sharding a vector database for billion-scale workloads is a complex system design challenge that forces a departure from traditional database scaling patterns.

Algorithmic Sharding is simple to implement but fails catastrophically on read performance due to universal query fan-out.

Metadata-Driven Sharding offers surgical precision and excellent performance for filtered queries but is brittle and susceptible to hot shards and poor performance on global queries.

Hybrid Two-Tier Sharding provides the most robust and production-ready solution. It contains the blast radius of query fan-out by using metadata to select a shard group, while leveraging algorithmic sharding within the group to ensure even load distribution and prevent hot spots.

Choosing the right strategy requires a deep understanding of your application's query patterns and data distribution. For any system expected to operate at a significant scale, the investment in a hybrid architecture is not just an optimization—it is a prerequisite for building a stable, performant, and scalable vector search platform.

The Inevitable Wall: When Vertical Scaling for Vector Search Fails

Strategy 1: Algorithmic Sharding (The Naive Approach)

Implementation: Client-Side Routing Logic

The Catastrophic Flaw: Read Amplification and Query Fan-Out

Strategy 2: Metadata-Driven Sharding (The Targeted Approach)

Implementation: A Multi-Tenant E-Commerce Scenario

Edge Cases and Production Challenges

Strategy 3: Hybrid Two-Tier Sharding (The Production-Grade Pattern)

Architecture Overview

Implementation: Two-Tier Routing Logic

Performance and Scalability Analysis

Final Considerations: The Unseen Complexities

Conclusion

Found this article helpful?