Vector DB Sharding for High-Cardinality Metadata Filtering at Scale
The High-Cardinality Bottleneck in Vector Search
In any production-grade vector search application—be it Retrieval-Augmented Generation (RAG), semantic search, or recommendation engines—the vector itself is only half the story. The other half is the associated metadata. A simple query isn't just "find vectors similar to X"; it's "find vectors similar to X where tenant_id is 'acme-corp' and is_public is false`".
While modern vector databases have evolved to handle metadata filtering, their performance characteristics often conceal a critical flaw that only becomes apparent at scale. The standard inverted indexes they build for metadata work exceptionally well for low-cardinality fields (is_public, document_type, status). However, when you introduce a high-cardinality field—an attribute with millions or tens of millions of unique values, like a user_id, product_id, or session_id—these indexes can become a significant performance bottleneck, leading to unacceptable query latencies.
This article bypasses the introductory "pre-filtering vs. post-filtering" discussion. We assume you've already identified that post-filtering is a non-starter at scale, as it retrieves irrelevant vectors and filters them in the application layer, destroying both performance and recall. Our focus is on advanced, production-ready architectural patterns for implementing efficient pre-filtering against high-cardinality metadata. We will dissect four distinct strategies, complete with implementation details, performance analysis, and the operational trade-offs inherent to each.
The Baseline: Standard Metadata Indexing and Its Limits
Before diving into sharding, let's establish a clear baseline. Most managed vector databases (e.g., Pinecone, Weaviate, Zilliz Cloud) and self-hosted solutions (e.g., Qdrant, Milvus) create inverted indexes on payload/metadata fields. When a query with a filter arrives, the query planner first uses this index to identify a candidate set of vector IDs that match the metadata criteria. It then performs the Approximate Nearest Neighbor (ANN) search only on this subset.
Scenario: A document search system with 100 million vectors. Metadata includes document_type (10 unique values) and user_id (5 million unique values).
A query for document_type: 'pdf' is efficient. The inverted index's posting list for 'pdf' is consulted, yielding a set of matching vector IDs. The search space is effectively pruned.
Now, consider a query for user_id: 1234567. The posting list for this specific user_id might be small, but the index itself is enormous. The B-tree or hash map structure used for the inverted index must manage 5 million unique keys. As cardinality grows, so does the index's memory footprint and the tree traversal depth, adding a non-trivial latency overhead before the ANN search even begins.
The Breaking Point: The problem is compounded when the query involves a high-cardinality field and the ANN search must still traverse a large portion of the filtered set. If user_id: 1234567 has 50,000 documents, the database must still perform an ANN search over those 50,000 vectors. While better than searching 100 million, it's far from ideal.
Code Example: Baseline Filtering (Qdrant)
Here's how a standard pre-filtering query looks. This is our starting point, which we will systematically improve.
import numpy as np
from qdrant_client import QdrantClient, models
# Assume client is configured
# client = QdrantClient(host="localhost", port=6333)
COLLECTION_NAME = "large_doc_collection"
# Example query vector
query_vector = np.random.rand(768).tolist()
# A typical query that struggles with high-cardinality user_id
def search_with_high_cardinality_filter(user_id: int, query_vec: list):
print(f"Searching for user_id: {user_id}")
# The filter that becomes slow at scale
search_filter = models.Filter(
must=[
models.FieldCondition(
key="user_id",
match=models.MatchValue(value=user_id),
),
models.FieldCondition(
key="document_type",
match=models.MatchValue(value="invoice"),
)
]
)
search_result = client.search(
collection_name=COLLECTION_NAME,
query_vector=query_vec,
query_filter=search_filter,
limit=10,
with_payload=True
)
return search_result
# In a production scenario with millions of users, this query's latency
# would be dominated by the time taken to resolve the 'user_id' index.
# Let's assume user_id 500000 has many documents.
results = search_with_high_cardinality_filter(user_id=500000, query_vec=query_vector)
print(results)
# Hypothetical Benchmark:
# - 100M vectors, 5M unique users
# - P99 Latency: ~650ms
# The goal of the following strategies is to drastically reduce this latency.
This baseline is unacceptable for user-facing applications. We need to physically partition the data to avoid scanning the massive metadata index altogether.
Strategy 1: Physical Sharding by Tenant/User Collection
This is the most direct and effective strategy for multi-tenant applications where data is strictly isolated. Instead of a single, monolithic collection, you create a separate collection for each high-cardinality entity, typically the tenant_id or user_id.
The query logic no longer relies on a metadata filter for tenant isolation. Instead, the application layer directs the query to the correct, pre-isolated collection.
Architecture:
tenant_id (e.g., from a JWT).vectors_{tenant_id}.Implementation:
This pattern moves the complexity from the database to your application logic. You need a robust mechanism to manage the lifecycle of thousands or millions of collections.
from qdrant_client import QdrantClient, models
from qdrant_client.http.models import Distance, VectorParams, UpdateStatus
import numpy as np
# This service acts as a router and abstraction over tenant collections
class MultiTenantVectorService:
def __init__(self, db_client: QdrantClient):
self.client = db_client
self.vector_size = 768
self.distance_metric = Distance.COSINE
def _get_collection_name(self, tenant_id: str) -> str:
# Sanitize tenant_id to be a valid collection name
return f"tenant_{tenant_id.replace('-', '_')}"
def ensure_collection_exists(self, tenant_id: str):
collection_name = self._get_collection_name(tenant_id)
try:
self.client.get_collection(collection_name=collection_name)
except Exception: # Simplified error handling; check for specific 404
print(f"Collection {collection_name} not found, creating it.")
self.client.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=self.vector_size, distance=self.distance_metric),
# Add HNSW config, quantization, etc. for production
)
def upsert(self, tenant_id: str, points: list[models.PointStruct]):
self.ensure_collection_exists(tenant_id)
collection_name = self._get_collection_name(tenant_id)
self.client.upsert(collection_name=collection_name, points=points, wait=True)
def search(self, tenant_id: str, query_vector: list, filter_payload: dict = None, limit: int = 10):
collection_name = self._get_collection_name(tenant_id)
# The high-cardinality tenant_id filter is GONE.
# The filtering now only happens on low-cardinality fields within the tenant's data.
search_filter = models.Filter.from_dict(filter_payload) if filter_payload else None
return self.client.search(
collection_name=collection_name,
query_vector=query_vector,
query_filter=search_filter,
limit=limit
)
# --- Usage Example ---
client = QdrantClient(host="localhost", port=6333)
service = MultiTenantVectorService(client)
tenant_1 = "acme-corp-123"
tenant_2 = "globex-inc-456"
# Upsert data for tenant 1
service.upsert(tenant_1, [
models.PointStruct(id=1, vector=np.random.rand(768).tolist(), payload={"doc_type": "pdf"}),
models.PointStruct(id=2, vector=np.random.rand(768).tolist(), payload={"doc_type": "docx"})
])
# Upsert data for tenant 2
service.upsert(tenant_2, [
models.PointStruct(id=1, vector=np.random.rand(768).tolist(), payload={"doc_type": "pdf"})
])
# Perform a search for tenant 1. Notice the query is simple and fast.
query_vec = np.random.rand(768).tolist()
results = service.search(
tenant_id=tenant_1,
query_vector=query_vec,
filter_payload={
"must": [
{"key": "doc_type", "match": {"value": "pdf"}}
]
}
)
print(f"Search results for {tenant_1}:\n{results}")
# Hypothetical Benchmark:
# - 100M vectors, 5M unique users (now 5M collections)
# - P99 Latency: ~45ms
# A >10x improvement. The search space is dramatically pruned before the query.
Pros:
* Maximum Performance: This is the fastest possible approach for tenant-isolated queries. The search space is minimal.
* Strong Data Isolation: Zero risk of data leakage between tenants at the database level.
* Solves Noisy Neighbor Problem: A heavy workload from one tenant won't impact the search performance of another (though they may still compete for CPU/memory on the node).
Cons & Edge Cases:
* Operational Overhead: Managing millions of collections can be a nightmare. Your tooling must be excellent. Some vector databases have limits on the number of collections.
* Resource Inefficiency: Each collection has a resource overhead (memory for indexes, metadata). Many small collections can be less efficient than one large one.
* Cross-Tenant Queries: Becomes extremely difficult or impossible. If you ever need to search across all public documents from all tenants, this architecture is prohibitive.
* "Super" Tenants: What if one tenant has 50 million vectors while others have 500? This pattern doesn't solve intra-tenant scaling issues. The acme-corp collection could still be slow to query.
Strategy 2: Logical Sharding with Partition Keys
Some vector databases, notably Milvus, offer a built-in feature called partitions. A partition is a logical division of data within a single collection. You can direct searches to scan only one or more specific partitions, achieving similar search-space-pruning benefits to the multi-collection strategy but with lower operational overhead.
Architecture:
user_id or category_id). This field should have high cardinality.partition_names parameter, telling the database which partitions to scan.Implementation (Milvus):
This example uses the pymilvus client to demonstrate creating and searching within partitions.
from pymilvus import (
connections, utility, Partition, Collection, FieldSchema, CollectionSchema, DataType
)
import numpy as np
# --- Setup ---
connections.connect("default", host="localhost", port="19530")
COLLECTION_NAME = "partitioned_product_catalog"
if utility.has_collection(COLLECTION_NAME):
utility.drop_collection(COLLECTION_NAME)
# Define schema. The partition key field does not need special schema treatment.
fields = [
FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="category_id", dtype=DataType.INT64), # Our partition key
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields, "Product catalog with partitions")
collection = Collection(COLLECTION_NAME, schema)
# Create the vector index
index_params = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {"nlist": 1024}
}
collection.create_index("embedding", index_params)
# --- Partition Management ---
# Create partitions for two categories
category_101 = Partition(collection, name="cat_101")
category_205 = Partition(collection, name="cat_205")
# Insert data directly into partitions
num_entities_per_cat = 5000
data_cat_101 = [
[i for i in range(num_entities_per_cat)],
[101] * num_entities_per_cat, # category_id
np.random.rand(num_entities_per_cat, 128).tolist()
]
data_cat_205 = [
[i for i in range(num_entities_per_cat, num_entities_per_cat * 2)],
[205] * num_entities_per_cat, # category_id
np.random.rand(num_entities_per_cat, 128).tolist()
]
category_101.insert(data_cat_101)
category_205.insert(data_cat_205)
collection.flush()
collection.load()
# --- Targeted Search ---
query_vector = [np.random.rand(128).tolist()]
# Search only within a specific category (partition)
search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
print("\nSearching in category 101...")
results_101 = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
partition_names=["cat_101"] # CRITICAL: This prunes the search space
)
print(results_101)
# You can also search across a few partitions
print("\nSearching in categories 101 and 205...")
results_multi = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
partition_names=["cat_101", "cat_205"]
)
print(results_multi)
# Hypothetical Benchmark:
# - 100M vectors, 10k categories (partitions)
# - P99 Latency: ~70ms
# Performance is comparable to multi-collection, but with better manageability.
Pros:
* Good Performance: Offers search space pruning similar to the multi-collection approach.
* Lower Operational Overhead: Managing partitions within one collection is far simpler than managing millions of collections. Database connection management is also simpler.
* Flexible Queries: Allows for efficient searching across a small, specified set of partitions, which is difficult with the multi-collection model.
Cons & Edge Cases:
* Not Universally Available: This is a specific feature of databases like Milvus and Zilliz. It's not a standard pattern you can apply everywhere.
* Partition Key Immutability: The partition key for a data point is typically fixed upon insertion.
* Management Still Required: You still need application logic to create/manage partitions as new categories or users appear, though it's less burdensome.
* Limits: Databases may have limits on the number of partitions per collection (e.g., Milvus limit is 4096). This makes it unsuitable for user_id sharding with millions of users, but excellent for category_id or region_id with thousands of values.
Strategy 3: Two-Stage Hybrid Search (Vector DB + Relational DB)
This is the most flexible and powerful pattern for handling extremely complex, multi-attribute, high-cardinality filtering. It acknowledges a fundamental truth: relational databases are still superior for complex metadata filtering. This pattern leverages the strengths of both systems.
Architecture:
* Stage 1 (Metadata Filtering): The initial request hits the orchestrator, which first queries the relational database. This query applies all the complex, high-cardinality filters (user_id, created_at > '2023-01-01', tag IN ('a', 'b', 'c'), etc.). The result of this query is a small list of vector IDs that match the criteria.
Stage 2 (Vector Search): The orchestrator then passes this explicit list of IDs to the vector database and instructs it to perform the ANN search only* on those specific vectors.
Implementation:
This pattern requires careful orchestration and error handling between the two database calls.
import psycopg2
from qdrant_client import QdrantClient, models
import numpy as np
# Assume clients for both databases are configured
# pg_conn = psycopg2.connect(database="mydb", user="user", password="pass", host="localhost", port="5432")
# qdrant_client = QdrantClient(host="localhost", port=6333)
COLLECTION_NAME = "hybrid_search_vectors"
class HybridSearchService:
def __init__(self, pg_conn, qdrant_client):
self.pg_conn = pg_conn
self.qdrant = qdrant_client
def search(self, user_id: int, tags: list[str], query_vector: list, limit: int = 10):
# --- Stage 1: Filter in PostgreSQL ---
print("Stage 1: Querying PostgreSQL for candidate vector IDs...")
candidate_ids = []
with self.pg_conn.cursor() as cursor:
# The SQL query can be arbitrarily complex and is highly optimized by Postgres
sql_query = "SELECT vector_id FROM documents WHERE user_id = %s AND tags && %s LIMIT 10000;"
cursor.execute(sql_query, (user_id, tags))
results = cursor.fetchall()
candidate_ids = [row[0] for row in results]
if not candidate_ids:
print("No documents matched metadata filters.")
return []
print(f"Found {len(candidate_ids)} candidates. Proceeding to Stage 2.")
# --- Stage 2: Search in Qdrant ---
# We use an 'id' filter to restrict the search to our candidate set.
# This is extremely fast as it's a direct ID lookup, not an index scan.
search_filter = models.Filter(
must=[
models.PointIdsFilter(has_id=candidate_ids)
]
)
search_result = self.qdrant.search(
collection_name=COLLECTION_NAME,
query_vector=query_vector,
query_filter=search_filter,
limit=limit,
# We don't need the payload from Qdrant, we can fetch it from Postgres if needed
with_payload=False
)
return search_result
# --- Usage Example ---
# Assume pg_conn and qdrant_client are initialized
# service = HybridSearchService(pg_conn, qdrant_client)
# query_vec = np.random.rand(768).tolist()
# # This query involves two high-cardinality dimensions (user_id, tags)
# results = service.search(
# user_id=987654,
# tags=['project_alpha', 'internal_only'],
# query_vector=query_vec
# )
# print(results)
# Hypothetical Benchmark:
# - 100M vectors, 5M users, 1M tags
# - Postgres query (Stage 1): ~15ms (with proper indexing)
# - Qdrant query (Stage 2): ~25ms
# - Total P99 Latency: ~40ms + network overhead
# This is often the highest performing and most flexible solution.
Pros:
* Ultimate Flexibility: Supports any metadata query complexity that a relational database can handle.
* Leverages Best-of-Breed: Uses the right tool for the job. Postgres for structured data querying, Vector DB for ANN search.
* Excellent Performance: If the relational query is fast (and it should be with proper indexing), the overall latency is extremely low, as the vector search space is pruned to the exact required set.
* Simplified Vector DB: The vector database becomes a simpler key-value store (ID -> Vector), reducing its operational complexity.
Cons & Edge Cases:
* Architectural Complexity: You now manage and maintain two separate database systems. Data consistency between them must be handled (e.g., using a message queue or Change Data Capture).
* Network Latency: There is an extra network hop between the orchestrator and the two databases.
* Candidate Set Size: What if the Stage 1 query returns 1 million candidate IDs? Passing a massive list of IDs to the vector database can be inefficient or hit query size limits. You must cap the number of candidate IDs (e.g., LIMIT 10000 in the SQL query), which introduces a trade-off: you might prematurely exclude the true nearest neighbors if they aren't in the first 10,000 metadata matches.
Decision Matrix: Choosing Your Strategy
There is no one-size-fits-all solution. The optimal strategy depends entirely on your application's specific access patterns, isolation requirements, and operational capacity.
| Strategy | Primary Use Case | Performance | Isolation | Operational Complexity | Flexibility |
|---|---|---|---|---|---|
| Baseline Metadata Indexing | Low-to-medium cardinality fields | Low | None | Low | High |
| Sharding by Tenant Collection | Strict multi-tenant isolation | Very High | Strong | Very High | Low |
| Partition Keys (e.g., Milvus) | Grouping by medium-cardinality keys (e.g., category) | High | Logical | Medium | Medium |
| Two-Stage Hybrid Search | Complex, multi-attribute, high-cardinality filters | Very High | N/A | High | Very High |
A Senior Engineer's Thought Process:
tenant_id) or many?tenant_id? Or do they search within a category_id across all users? If the latter, Strategy 2 (Partition Keys) is a better fit (if your DB supports it).By systematically evaluating these trade-offs, you can design a vector search system that not only meets today's requirements but also scales gracefully as your data and its complexity grow.