Scaling RAG: Vector DB Sharding for Sub-100ms LLM-Powered Search

21 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Monolith: When Single-Node Vector Search Fails

In production Retrieval-Augmented Generation (RAG) systems, the initial proof-of-concept, often built on a single FAISS index or a default vector database instance, performs admirably with a few million vectors. Latencies are low, relevance is high, and the system feels magical. This honeymoon phase ends abruptly as the document corpus scales from millions to tens of millions, and then to billions of embeddings. The monolithic vector index, once a source of speed, becomes the primary performance bottleneck.

The core issue lies in the nature of Approximate Nearest Neighbor (ANN) search algorithms like HNSW (Hierarchical Navigable Small World). While incredibly efficient, their performance characteristics degrade non-linearly with index size and dimensionality.

Key Failure Modes of a Monolithic Index:

  • Memory Saturation: HNSW graphs, the data structure behind modern ANN search, must reside in RAM for low-latency performance. A billion 768-dimensional float32 vectors require 1,000,000,000 768 4 bytes ≈ 3.072 TB of RAM for the raw vectors alone, plus a significant overhead (often 1.5-2x) for the graph structure itself. This rapidly exceeds the capacity of even the largest cloud instances.
  • Latency Explosion: As the graph grows, the number of hops required to traverse it and find the nearest neighbors increases. While HNSW's complexity is poly-logarithmic (O(log N)), the constant factors matter. P99 query latency can balloon from 50ms to over 500ms, rendering real-time applications unusable.
  • Indexing Pauses and Cold Starts: Building or updating a massive HNSW graph is a computationally intensive process. It can lead to significant indexing lag or require lengthy warm-up periods after a restart, impacting data freshness and system availability.
  • Simply scaling vertically—throwing more RAM and CPU at a single node—is a losing battle. It's prohibitively expensive and eventually hits a hard physical limit. The only viable path forward is to scale horizontally through sharding.


    Foundational Sharding Strategy: Metadata-Based Partitioning

    Sharding a vector database involves partitioning the entire dataset of vectors into smaller, independent, and more manageable indexes, or shards. While a naive approach might be to randomly distribute vectors across shards, this is catastrophically inefficient for queries, as it forces a search across every single shard (a full scatter-gather) for every lookup.

    A far more effective and production-proven strategy is metadata-based sharding. This approach leverages the inherent structure within your data to intelligently route vectors to specific shards during ingestion. The choice of the shard key is the most critical architectural decision you will make.

    Commonly used shard keys include:

    * tenant_id: In multi-tenant SaaS applications, this is the most natural and effective key. It perfectly isolates customer data, simplifying security, data management, and queries.

    * document_source: Grouping by the origin of the data (e.g., 'confluence', 'jira', 'slack').

    * date_range: Sharding by time (e.g., YYYY-MM) can be effective for time-series or log-based data, allowing older, less-frequently-accessed data to be placed on cheaper storage.

    * region or language: For global applications, partitioning by geography or language can reduce cross-region latency.

    Ingestion Pipeline with Shard Routing Logic

    The logic for sharding is enforced at the point of ingestion. The ingestion pipeline becomes responsible for inspecting document metadata and directing the resulting vectors to the appropriate database shard client.

    Here is a Python implementation of an ingestion service that routes documents to different shards based on a tenant_id.

    python
    import hashlib
    import os
    from typing import Dict, Any, List
    
    # Assume we have client connections to our vector database shards.
    # In a real system, this would be managed by a connection pool or service discovery.
    # For this example, we'll simulate it with a dictionary of clients.
    
    class VectorDBClientMock:
        def __init__(self, shard_id: str):
            self.shard_id = shard_id
            print(f"Initialized client for shard: {self.shard_id}")
    
        def insert_vectors(self, vectors: List[Dict[str, Any]]):
            # In a real implementation, this would be a network call to the specific shard.
            print(f"[{self.shard_id}] Inserting {len(vectors)} vectors.")
            # Example vector structure: {'id': '...', 'vector': [...], 'metadata': {...}}
            pass
    
    # --- Configuration ---
    NUM_SHARDS = 4
    SHARD_PREFIX = "vector-shard-"
    
    # --- Shard Client Management ---
    SHARD_CLIENTS = {
        f"{SHARD_PREFIX}{i}": VectorDBClientMock(f"{SHARD_PREFIX}{i}") 
        for i in range(NUM_SHARDS)
    }
    
    def get_shard_id_for_tenant(tenant_id: str) -> str:
        """Determines the shard ID for a given tenant using consistent hashing."""
        # Using a simple modulo hash for demonstration.
        # In production, consider more robust consistent hashing algorithms like Rendezvous Hashing
        # or using a library like `uhashring` to handle node additions/removals gracefully.
        hasher = hashlib.sha256(tenant_id.encode('utf-8'))
        hash_val = int(hasher.hexdigest(), 16)
        shard_index = hash_val % NUM_SHARDS
        return f"{SHARD_PREFIX}{shard_index}"
    
    class IngestionService:
        def __init__(self, shard_clients: Dict[str, VectorDBClientMock]):
            self.shard_clients = shard_clients
    
        def process_document(self, document: Dict[str, Any]):
            """Processes a single document, generates embeddings, and routes to the correct shard."""
            metadata = document.get("metadata", {})
            tenant_id = metadata.get("tenant_id")
    
            if not tenant_id:
                print("Skipping document - missing 'tenant_id' in metadata.")
                return
    
            # 1. Determine the target shard
            shard_id = get_shard_id_for_tenant(tenant_id)
            target_client = self.shard_clients.get(shard_id)
    
            if not target_client:
                print(f"Error: No client found for shard_id: {shard_id}")
                return
    
            # 2. Generate embeddings (mocked here)
            # In a real system, this calls an embedding model like SBERT, OpenAI, etc.
            vectors_to_insert = self._generate_embeddings(document)
    
            # 3. Insert into the target shard
            target_client.insert_vectors(vectors_to_insert)
            print(f"Routed document {document['id']} for tenant {tenant_id} to shard {shard_id}")
    
        def _generate_embeddings(self, document: Dict[str, Any]) -> List[Dict[str, Any]]:
            # This is a placeholder for a real embedding generation process.
            # A single document might be chunked into multiple vectors.
            return [
                {
                    "id": f"{document['id']}-0",
                    "vector": [0.1] * 768, # Placeholder 768-dim vector
                    "metadata": document["metadata"]
                }
            ]
    
    # --- Example Usage ---
    if __name__ == "__main__":
        ingestion_service = IngestionService(SHARD_CLIENTS)
    
        doc1 = {"id": "doc-abc-123", "content": "...", "metadata": {"tenant_id": "tenant-A"}}
        doc2 = {"id": "doc-def-456", "content": "...", "metadata": {"tenant_id": "tenant-B"}}
        doc3 = {"id": "doc-ghi-789", "content": "...", "metadata": {"tenant_id": "tenant-C"}}
        doc4 = {"id": "doc-jkl-012", "content": "...", "metadata": {"tenant_id": "tenant-A"}} # Another doc for tenant-A
    
        ingestion_service.process_document(doc1)
        ingestion_service.process_document(doc2)
        ingestion_service.process_document(doc3)
        ingestion_service.process_document(doc4)

    This code establishes the fundamental pattern: the application layer is responsible for routing, not the database. This gives us maximum flexibility but requires a new component: a query router.


    The Brains of the Operation: The Metadata-Aware Query Router

    With data neatly partitioned across shards, the query path must be equally intelligent. A Query Router service becomes the single entry point for all search requests. Its primary responsibility is to inspect incoming queries, identify the target shard(s) based on metadata filters, and dispatch the query accordingly.

    This architecture transforms a monolithic search problem into a highly parallelizable one. A query for tenant-A will only hit the shard containing tenant-A's data, ignoring all other shards. This is the key to maintaining low latency at scale.

    Below is a production-grade implementation of a query router using FastAPI.

    python
    from fastapi import FastAPI, HTTPException, Depends
    from pydantic import BaseModel, Field
    from typing import List, Dict, Any, Optional
    
    # --- Data Models for API ---
    class QueryRequest(BaseModel):
        query_vector: List[float]
        top_k: int = 10
        filters: Dict[str, Any] = Field(..., description="Metadata filters to apply. MUST contain 'tenant_id'.")
    
    class QueryResult(BaseModel):
        id: str
        score: float
        metadata: Dict[str, Any]
    
    # --- Shard Client Simulation (same as before) ---
    class VectorDBClientMock:
        def __init__(self, shard_id: str):
            self.shard_id = shard_id
    
        def search(self, vector: List[float], top_k: int, filters: Dict[str, Any]) -> List[QueryResult]:
            print(f"[{self.shard_id}] Searching with filters: {filters}")
            # Mocked search results
            return [
                QueryResult(
                    id=f"mock-doc-{i}", 
                    score=1.0 - (i * 0.05), 
                    metadata=filters
                ) for i in range(top_k)
            ]
    
    # --- Configuration and Client Management ---
    NUM_SHARDS = 4
    SHARD_PREFIX = "vector-shard-"
    SHARD_CLIENTS = {
        f"{SHARD_PREFIX}{i}": VectorDBClientMock(f"{SHARD_PREFIX}{i}") 
        for i in range(NUM_SHARDS)
    }
    
    # Dependency to manage clients
    def get_shard_clients():
        return SHARD_CLIENTS
    
    # Hashing function (must be identical to the one in the ingestion service)
    def get_shard_id_for_tenant(tenant_id: str) -> str:
        import hashlib
        hasher = hashlib.sha256(tenant_id.encode('utf-8'))
        hash_val = int(hasher.hexdigest(), 16)
        shard_index = hash_val % NUM_SHARDS
        return f"{SHARD_PREFIX}{shard_index}"
    
    # --- FastAPI Application ---
    app = FastAPI(
        title="Vector DB Query Router",
        description="Routes queries to the appropriate vector database shard based on metadata."
    )
    
    @app.post("/v1/query", response_model=List[QueryResult])
    async def query_shard(
        request: QueryRequest,
        shard_clients: Dict[str, VectorDBClientMock] = Depends(get_shard_clients)
    ):
        """Receives a query, determines the target shard, and executes the search."""
        tenant_id = request.filters.get("tenant_id")
    
        if not tenant_id or not isinstance(tenant_id, str):
            raise HTTPException(
                status_code=400,
                detail="'tenant_id' is a required string field in the filters."
            )
    
        # 1. Route: Determine the target shard
        shard_id = get_shard_id_for_tenant(tenant_id)
        target_client = shard_clients.get(shard_id)
    
        if not target_client:
            raise HTTPException(
                status_code=503, # Service Unavailable
                detail=f"Shard '{shard_id}' is currently unavailable."
            )
    
        # 2. Dispatch: Forward the query to the specific shard
        try:
            results = target_client.search(
                vector=request.query_vector,
                top_k=request.top_k,
                filters=request.filters
            )
            return results
        except Exception as e:
            # In a real system, add proper logging and error handling
            # for network timeouts, shard errors, etc.
            raise HTTPException(
                status_code=500,
                detail=f"An error occurred while querying shard '{shard_id}': {e}"
            )
    
    # To run this app: `uvicorn query_router_app:app --reload`
    # Example curl request:
    # curl -X POST "http://127.0.0.1:8000/v1/query" -H "Content-Type: application/json" -d '{
    #   "query_vector": [0.1, 0.2, 0.3],
    #   "top_k": 5,
    #   "filters": {"tenant_id": "tenant-A", "category": "finance"}
    # }'

    This router is simple but effective. It enforces the critical contract that every query must specify a tenant_id, allowing for precise, single-shard lookups.


    Advanced Pattern: Multi-Shard Queries and Result Aggregation

    The single-shard lookup pattern covers 95% of use cases. However, there are legitimate scenarios that require querying across multiple, or even all, shards:

    * An admin-level search for content across all tenants.

    * A query where the shard key is not present (e.g., searching by a document tag that exists across tenants).

    This requires a scatter-gather pattern: the router sends the query to all relevant shards in parallel and then aggregates the results. The aggregation step is non-trivial.

    A naive approach of concatenating results and sorting by score is flawed because similarity scores (like cosine similarity or L2 distance) are not directly comparable across different, independently-built vector indexes. A score of 0.9 in Shard A might represent a better match than a score of 0.92 in Shard B due to variations in data distribution within each shard.

    Reciprocal Rank Fusion (RRF)

    A more robust method for merging result sets from different systems is Reciprocal Rank Fusion (RRF). RRF is a simple yet powerful algorithm that disregards the absolute scores and instead uses the rank of each document in its original result list. It provides a stable and effective way to fuse multiple ranked lists into a single, high-quality result set.

    The RRF score for a document is calculated as:

    RRF_Score(d) = Σ (1 / (k + rank(d, i))) for each list i where document d appears.

    * rank(d, i) is the rank of document d in result list i.

    * k is a constant (commonly set to 60) that dampens the influence of high ranks.

    Here's how to implement RRF and integrate it into a multi-shard query endpoint.

    python
    import asyncio
    from collections import defaultdict
    
    # ... (Keep all previous FastAPI code and models)
    
    # Let's add a new endpoint to our FastAPI app for multi-shard queries
    
    async def query_single_shard_async(
        client: VectorDBClientMock, 
        vector: List[float], 
        top_k: int, 
        filters: Dict[str, Any]
    ) -> List[QueryResult]:
        # In a real system, the client's search method would be async
        loop = asyncio.get_running_loop()
        return await loop.run_in_executor(
            None, client.search, vector, top_k, filters
        )
    
    def fuse_results_with_rrf(shard_results: List[List[QueryResult]], k: int = 60) -> List[QueryResult]:
        """Aggregates results from multiple shards using Reciprocal Rank Fusion."""
        fused_scores = defaultdict(float)
        doc_registry = {}
    
        for results in shard_results:
            for rank, doc in enumerate(results):
                if doc.id not in doc_registry:
                    doc_registry[doc.id] = doc
                
                # RRF formula
                fused_scores[doc.id] += 1 / (k + rank + 1) # rank is 0-indexed
    
        # Sort documents by their fused RRF score in descending order
        sorted_doc_ids = sorted(fused_scores.keys(), key=lambda id: fused_scores[id], reverse=True)
    
        # Build the final, ranked list of results
        final_results = []
        for doc_id in sorted_doc_ids:
            doc = doc_registry[doc_id]
            # We can update the score to be the RRF score for transparency
            doc.score = fused_scores[doc_id]
            final_results.append(doc)
    
        return final_results
    
    @app.post("/v1/query/multi-shard", response_model=List[QueryResult])
    async def query_multi_shard(
        request: QueryRequest,
        shard_clients: Dict[str, VectorDBClientMock] = Depends(get_shard_clients)
    ):
        """Performs a scatter-gather query across all shards and fuses results with RRF."""
        # This is an admin-level query, so we ignore the tenant_id routing
        # and query all shards in parallel.
    
        target_shards = list(shard_clients.values())
    
        # 1. Scatter: Dispatch queries to all shards concurrently
        tasks = [
            query_single_shard_async(
                client=shard,
                vector=request.query_vector,
                top_k=request.top_k,
                # We might pass down other filters, but not the tenant_id
                filters={k: v for k, v in request.filters.items() if k != 'tenant_id'}
            ) 
            for shard in target_shards
        ]
        shard_results = await asyncio.gather(*tasks)
    
        # 2. Gather & Fuse: Aggregate results using RRF
        final_ranked_list = fuse_results_with_rrf(shard_results)
    
        # 3. Return the top_k results from the fused list
        return final_ranked_list[:request.top_k]
    

    This pattern provides a powerful escape hatch for queries that don't fit the single-shard model, ensuring the architecture remains flexible.


    Production Headaches: Edge Cases and Performance Tuning

    Implementing a sharded vector search system introduces new complexities that must be managed in a production environment.

  • Shard Hotspotting: If one tenant is significantly larger than others, their dedicated shard can become a bottleneck (a "hotspot"), re-introducing the original monolith problem at a smaller scale.
  • Solution: Implement a two-level hashing strategy. Use the tenant_id to find a set* of potential shards, and then use a secondary key (like the document_id) to pick a shard within that set. This allows a large tenant's data to be spread across multiple physical shards while still maintaining logical isolation.

  • Rebalancing and Resharding: What happens when you need to add more shards to the cluster? A simple modulo hash (hash % NUM_SHARDS) is brittle; changing NUM_SHARDS would require re-ingesting all data.
  • * Solution: Use a consistent hashing algorithm from the start. Libraries like uhashring in Python map keys to a continuous ring, so the addition or removal of a node only requires a fraction of keys to be remapped. This allows for dynamic scaling of the shard cluster with minimal data movement.

  • Monitoring and Alerting: A sharded system is harder to monitor than a monolith.
  • * Solution: Your observability platform must track metrics on a per-shard basis. Critical metrics to dashboard and alert on include:

    * query_latency_p95{shard="shard-1"}

    * vector_count{shard="shard-1"}

    * ingestion_rate_per_second{shard="shard-1"}

    * cpu_utilization{shard="shard-1"}

    * memory_usage{shard="shard-1"}

    An alert on high latency for a specific shard is a direct indicator of a hotspot or hardware issue.

  • Schema and Model Updates: How do you roll out a new embedding model across a live, sharded cluster?
  • * Solution: Treat it as a blue-green deployment at the infrastructure level. Set up a parallel "green" cluster of shards. Backfill it with data processed by the new model. Once fully indexed and validated, update the query router to point to the new cluster. This ensures zero-downtime model migrations.


    A Concrete Example: Multi-Tenancy in Weaviate

    Let's move from abstract clients to a concrete implementation. Modern vector databases like Weaviate have first-class support for multi-tenancy, which is a managed form of metadata-based sharding.

    With Weaviate, you can create a single collection (e.g., Documents) and enable the multi-tenancy feature. Each tenant_id you add creates a separate, isolated HNSW index within that collection. Weaviate's coordinator handles the routing internally, simplifying the application logic.

    Here is a docker-compose.yml to set up Weaviate:

    yaml
    version: '3.4'
    services:
      weaviate:
        image: cr.weaviate.io/semitechnologies/weaviate:1.24.1
        ports:
          - "8080:8080"
        restart: on-failure:0
        environment:
          QUERY_DEFAULTS_LIMIT: 25
          AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
          PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
          DEFAULT_VECTORIZER_MODULE: 'none' # We will provide our own vectors
          ENABLE_MODULES: ''
          CLUSTER_HOSTNAME: 'node1'

    And here's how the client-side code in your router and ingestion service would interact with it:

    python
    import weaviate
    import weaviate.classes as wvc
    from weaviate.tenants import Tenant
    
    # Client connects to the single Weaviate endpoint
    client = weaviate.Client("http://localhost:8080")
    
    # --- Ingestion Side ---
    def provision_tenants_and_collection():
        # Delete if it exists, for idempotency in testing
        if client.schema.exists("Document"):
            client.schema.delete_class("Document")
    
        client.schema.create_class({
            "class": "Document",
            "vectorizer": "none",
            "multiTenancyConfig": {"enabled": True}
        })
        
        # Create tenants, which provisions their isolated indexes
        tenants = [Tenant(name="tenant-A"), Tenant(name="tenant-B")]
        client.schema.add_tenants(class_name="Document", tenants=tenants)
    
    def ingest_for_tenant(tenant_id: str, doc_id: str, vector: List[float]):
        # Get a client specifically for this tenant's data
        tenant_collection = client.collections.get("Document").with_tenant(tenant_id)
        
        tenant_collection.data.insert(
            properties={
                "doc_id": doc_id
            },
            uuid=doc_id,
            vector=vector
        )
        print(f"Ingested doc {doc_id} for tenant {tenant_id}")
    
    # --- Query Router Side ---
    def query_for_tenant(tenant_id: str, vector: List[float], top_k: int):
        # The router simply selects the correct tenant context before querying
        tenant_collection = client.collections.get("Document").with_tenant(tenant_id)
        
        response = tenant_collection.query.near_vector(
            near_vector=vector,
            limit=top_k
        )
        
        results = []
        for item in response.objects:
            results.append({
                "id": item.uuid,
                "properties": item.properties
            })
        return results
    
    # --- Main execution logic ---
    if __name__ == "__main__":
        provision_tenants_and_collection()
        ingest_for_tenant("tenant-A", "doc-abc-123", [0.1] * 128)
        ingest_for_tenant("tenant-B", "doc-def-456", [0.9] * 128)
    
        # Query for tenant-A, will ONLY search tenant-A's data
        results_A = query_for_tenant("tenant-A", [0.11] * 128, 5)
        print("\nResults for Tenant A:", results_A)
        # This search will find doc-abc-123
    
        # Query for tenant-A with a vector closer to tenant-B's data
        results_A_miss = query_for_tenant("tenant-A", [0.89] * 128, 5)
        print("\nResults for Tenant A (expect miss):", results_A_miss)
        # This search will NOT find doc-def-456, proving data isolation

    While managed multi-tenancy is simpler, understanding the underlying principles of sharding, routing, and aggregation is crucial for debugging performance issues, planning capacity, and building custom solutions when managed features fall short.

    Conclusion: From Prototype to Production

    Scaling RAG systems is a journey from a single-node prototype to a distributed, resilient, and highly-performant search architecture. Monolithic vector indexes are a dead end. The path to supporting billions of documents with sub-100ms latency is paved with horizontal sharding.

    By implementing metadata-based partitioning, a smart query router, and robust strategies for result aggregation like RRF, engineering teams can build LLM-powered features that are not only powerful but also scalable and cost-effective. The architectural patterns discussed here—intelligent ingestion, single-shard lookups, and scatter-gather with fusion—represent the blueprint for production-grade vector search and are fundamental to the next generation of AI-native applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles