Serverless Hybrid Search with Reciprocal Rank Fusion on AWS
Beyond Naive RAG: The Case for Production-Grade Hybrid Search
In the current AI landscape, many search-augmented generation (RAG) systems rely solely on vector similarity search. While powerful for capturing semantic meaning, this approach has a critical flaw in production: it often fails on queries requiring exact lexical matches, such as product SKUs, specific error codes, legal clauses, or proper nouns. A query for Error 0x80070005
will likely yield semantically related but incorrect results in a pure vector system. Conversely, traditional keyword search (like TF-IDF or BM25) excels at these exact matches but fails to understand nuance, synonyms, or user intent.
Production-grade search demands the best of both worlds. The solution is hybrid search: a technique that combines results from both lexical and semantic search engines. However, the real challenge lies not in running two separate queries, but in intelligently fusing their results into a single, cohesive, and relevance-ranked list.
Simple approaches like weighted score combination are notoriously difficult to tune and maintain. The scores from a BM25 query and a k-NN vector search are on different scales and have different distributions, making any linear combination fragile. This is where Reciprocal Rank Fusion (RRF) provides a more robust, elegant, and production-ready solution.
RRF is a rank-based fusion method, meaning it disregards the raw, incomparable scores from different systems and focuses solely on the position of a document in each result list. This makes it inherently resilient to score scaling issues. In this post, we will architect and implement a high-performance, cost-effective serverless hybrid search system on AWS using Amazon OpenSearch Serverless for querying and AWS Lambda for orchestration, all centered around a sophisticated RRF implementation.
The Core Architecture: A Serverless Blueprint
Our target architecture is designed for scalability, maintainability, and pay-per-use cost efficiency. It avoids managing dedicated servers by leveraging a fully serverless stack.
Components:
Here is a high-level diagram of the query flow:
graph TD
A[Client] -->|HTTPS Request| B(API Gateway)
B -->|Proxy Integration| C{Search Orchestrator Lambda}
C -->|1. Generate Query Embedding| D[Embedding Service]
D -->|Embedding Vector| C
subgraph Parallel Queries
C -->|2a. BM25 Query| E[OpenSearch Serverless]
C -->|2b. k-NN Query| E
end
E -->|BM25 Results| C
E -->|k-NN Results| C
C -->|3. Perform RRF| F[RRF Logic]
F -->|Fused & Ranked IDs| C
C -->|4. Fetch Documents (Optional)| E
C -->|5. Return Fused Results| B
B -->|JSON Response| A
Deep Dive: Reciprocal Rank Fusion (RRF)
RRF's elegance lies in its simplicity and effectiveness. The formula to calculate the RRF score for a document d
is:
RRF_Score(d) = Σ (1 / (k + rank_i(d)))
Where:
rank_i(d)
is the rank of document d
in the result set i
(e.g., the BM25 result set or the k-NN result set).- The sum is over all result sets.
k
is a constant used to mitigate the impact of documents with very high ranks in a single result set dominating the final score. A common value for k
is 60, as suggested in the original paper, but this is a tunable hyperparameter.Why is this better than score-based fusion?
Consider a BM25 search returning a document with a score of 25.7
and a vector search returning the same document with a score of 0.91
(cosine similarity). How do you combine these? You could try to normalize them to a [0, 1]
range, but the distributions are different. A BM25 score of 25.7
might be exceptional, while a cosine similarity of 0.91
might be mediocre for your dataset. RRF sidesteps this entirely. If the document is rank 1
in both searches, its contribution is (1 / (k + 1)) + (1 / (k + 1))
, regardless of the underlying scores.
Let's implement this in Python.
# rrf_fusion.py
from collections import defaultdict
def reciprocal_rank_fusion(search_results: list[list[str]], k: int = 60) -> dict[str, float]:
"""
Performs Reciprocal Rank Fusion on multiple lists of search results.
Args:
search_results: A list of lists, where each inner list contains document IDs
ordered by relevance from a single search system.
k: A constant to tune the fusion algorithm. Defaults to 60.
Returns:
A dictionary mapping document IDs to their fused RRF scores.
"""
fused_scores = defaultdict(float)
# Each result_list is one of our searches (e.g., BM25, k-NN)
for result_list in search_results:
# Iterate through the ranked documents in the current list
for rank, doc_id in enumerate(result_list, 1):
# The core RRF formula
fused_scores[doc_id] += 1 / (k + rank)
return fused_scores
# Example Usage:
# BM25 results (lexical search)
bm25_results = ["doc_3", "doc_1", "doc_5"]
# k-NN results (semantic search)
knn_results = ["doc_1", "doc_4", "doc_3", "doc_2"]
all_results = [bm25_results, knn_results]
fused = reciprocal_rank_fusion(all_results)
# Sort the results by score in descending order
sorted_fused = sorted(fused.items(), key=lambda item: item[1], reverse=True)
print("BM25 Results:", bm25_results)
print("k-NN Results:", knn_results)
print("\nFused and Sorted Results:")
for doc_id, score in sorted_fused:
print(f" {doc_id}: {score:.4f}")
# --- Expected Output ---
# BM25 Results: ['doc_3', 'doc_1', 'doc_5']
# k-NN Results: ['doc_1', 'doc_4', 'doc_3', 'doc_2']
#
# Fused and Sorted Results:
# doc_1: 0.0325 # rank 2 in BM25, rank 1 in k-NN
# doc_3: 0.0322 # rank 1 in BM25, rank 3 in k-NN
# doc_4: 0.0161 # only in k-NN at rank 2
# doc_5: 0.0159 # only in BM25 at rank 3
# doc_2: 0.0156 # only in k-NN at rank 4
Notice how doc_1
and doc_3
, which appear in both lists, bubble to the top. doc_1
wins because its ranks (2
and 1
) are better on average than doc_3
's (1
and 3
).
Production Implementation on AWS
1. Setting up OpenSearch Serverless
First, create an OpenSearch Serverless "collection". For this use case, we need an index that supports both text search and vector search.
Index mapping definition (index_mapping.json
):
{
"settings": {
"index.knn": true,
"index.knn.space_type": "cosinesimil"
},
"mappings": {
"properties": {
"doc_id": {
"type": "keyword"
},
"title": {
"type": "text"
},
"content": {
"type": "text"
},
"embedding": {
"type": "knn_vector",
"dimension": 768, // MUST match your embedding model's dimension
"method": {
"name": "hnsw",
"engine": "faiss",
"parameters": {
"ef_construction": 256,
"m": 48
}
}
}
}
}
}
index.knn: true
: Enables the k-NN plugin.embedding.type: knn_vector
: Defines the field for storing vector embeddings.dimension
: This is critical. It must exactly match the output dimension of your embedding model (e.g., 768 for all-MiniLM-L6-v2
, 1536 for OpenAI text-embedding-ada-002
).method
: We use hnsw
with the faiss
engine, a standard for high-performance approximate nearest neighbor search.2. The Search Orchestrator Lambda
This is the heart of our system. We'll use Python with boto3
and the opensearch-py
library. The key to performance is executing the BM25 and k-NN queries in parallel.
Performance Consideration: Connection Pooling
AWS Lambda freezes the execution environment between invocations. We can exploit this to reuse TCP connections and client objects, significantly reducing latency on warm invocations. We will initialize the OpenSearch client outside the main lambda_handler
function.
Lambda Code (search_lambda.py
):
import os
import json
import asyncio
from collections import defaultdict
import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
# --- Environment Variables ---
# OPENSEARCH_HOST: The OpenSearch Serverless collection endpoint (e.g., *.aoss.us-east-1.amazonaws.com)
# EMBEDDING_ENDPOINT_NAME: Name of the SageMaker endpoint for embeddings
# --- Initialization (outside handler for reuse) ---
SERVICE = 'aoss'
REGION = os.environ.get('AWS_REGION', 'us-east-1')
OPENSEARCH_HOST = os.environ['OPENSEARCH_HOST']
INDEX_NAME = 'hybrid-search-index'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
credentials.access_key,
credentials.secret_key,
REGION,
SERVICE,
session_token=credentials.token
)
sagemaker_runtime = boto3.client('sagemaker-runtime')
# Initialize OpenSearch client
os_client = OpenSearch(
hosts=[{'host': OPENSEARCH_HOST, 'port': 443}],
http_auth=awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
pool_maxsize=20 # Important for parallel queries
)
# --- Helper Functions ---
def get_embedding(text: str) -> list[float]:
"""Invokes a SageMaker endpoint to get the text embedding."""
endpoint_name = os.environ['EMBEDDING_ENDPOINT_NAME']
response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType='application/json',
Body=json.dumps({'inputs': [text]})
)
result = json.loads(response['Body'].read().decode())
return result[0] # Assuming endpoint returns a list of embeddings
async def run_bm25_query(query_text: str, size: int) -> list[str]:
"""Runs a standard BM25 match query."""
query = {
'size': size,
'_source': False, # We only need the doc IDs for fusion
'query': {
'multi_match': {
'query': query_text,
'fields': ['title', 'content']
}
}
}
loop = asyncio.get_running_loop()
response = await loop.run_in_executor(
None,
lambda: os_client.search(index=INDEX_NAME, body=query)
)
return [hit['_id'] for hit in response['hits']['hits']]
async def run_knn_query(query_vector: list[float], size: int) -> list[str]:
"""Runs a k-NN vector search query."""
query = {
'size': size,
'_source': False,
'query': {
'knn': {
'embedding': {
'vector': query_vector,
'k': size
}
}
}
}
loop = asyncio.get_running_loop()
response = await loop.run_in_executor(
None,
lambda: os_client.search(index=INDEX_NAME, body=query)
)
return [hit['_id'] for hit in response['hits']['hits']]
def reciprocal_rank_fusion(search_results: list[list[str]], k: int = 60) -> list[str]:
"""Performs RRF and returns a sorted list of doc IDs."""
fused_scores = defaultdict(float)
for result_list in search_results:
for rank, doc_id in enumerate(result_list, 1):
fused_scores[doc_id] += 1 / (k + rank)
if not fused_scores:
return []
sorted_docs = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
return [doc_id for doc_id, _ in sorted_docs]
# --- Main Handler ---
async def main_logic(event):
"""Main async logic for search orchestration."""
query_params = event.get('queryStringParameters', {})
query_text = query_params.get('q')
if not query_text:
return {'statusCode': 400, 'body': json.dumps({'error': 'Query parameter "q" is required'})}
k_val = int(query_params.get('k_rrf', 60))
size = int(query_params.get('size', 10))
# 1. Get query embedding
try:
query_vector = get_embedding(query_text)
except Exception as e:
print(f"Error getting embedding: {e}")
return {'statusCode': 500, 'body': json.dumps({'error': 'Failed to generate query embedding'})}
# 2. Run queries in parallel
tasks = [
run_bm25_query(query_text, size=size * 2), # Fetch more results to give RRF more to work with
run_knn_query(query_vector, size=size * 2)
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Error handling for parallel tasks
bm25_results, knn_results = [], []
if isinstance(results[0], list):
bm25_results = results[0]
else:
print(f"BM25 query failed: {results[0]}")
if isinstance(results[1], list):
knn_results = results[1]
else:
print(f"k-NN query failed: {results[1]}")
# 3. Perform Reciprocal Rank Fusion
fused_ids = reciprocal_rank_fusion([bm25_results, knn_results], k=k_val)
# 4. Fetch full documents for the top N fused results (optional)
# For simplicity, we return IDs. In production, you'd fetch from DynamoDB or OpenSearch `_source`.
final_results = fused_ids[:size]
return {
'statusCode': 200,
'headers': {'Content-Type': 'application/json'},
'body': json.dumps({
'results': final_results,
'fusion_meta': {
'bm25_count': len(bm25_results),
'knn_count': len(knn_results),
'fused_count': len(final_results)
}
})
}
def lambda_handler(event, context):
return asyncio.run(main_logic(event))
Key Implementation Details:
asyncio.gather
to run the lexical and semantic queries concurrently. This is crucial for minimizing latency, as the total query time becomes max(latency_bm25, latency_knn)
instead of sum(latency_bm25, latency_knn)
.size * 2
results from each search system. This provides more candidate documents for the fusion algorithm to re-rank, often leading to better results in the final top size
list.asyncio.gather
with return_exceptions=True
ensures that if one search system fails (e.g., k-NN times out), the other can still provide results. The RRF function naturally handles an empty list from one of the sources.k
: The RRF k
parameter is exposed via a query parameter k_rrf
, allowing for dynamic tuning and experimentation without redeploying the Lambda.3. Deployment and IAM
opensearch-py
, requests-aws4auth
dependencies in a Lambda layer or zip file. - aoss:APIAccessAll
on the OpenSearch Serverless collection.
- sagemaker:InvokeEndpoint
on the embedding model's endpoint.
- Standard AWSLambdaBasicExecutionRole
for CloudWatch Logs.
Advanced Edge Cases and Production Considerations
Tuning the RRF `k` Constant
The k
parameter in RRF controls the steepness of the rank-to-score decay curve. A smaller k
gives disproportionately more weight to top-ranked items, making the fusion more sensitive to the #1 and #2 spots. A larger k
flattens the curve, giving more consideration to items ranked lower in the original lists.
k
(e.g., 2-10): Use when you have very high confidence in your search systems' top results. This can be useful in navigational queries where the top hit is almost always the correct one.k
(e.g., 60-100): A safer, more general-purpose choice. It prevents a single system's potentially noisy top result from dominating the fusion. k=60
is a well-regarded starting point.To tune k
effectively, you need an offline evaluation set with queries and labeled relevant documents. You can then iterate through different values of k
and measure the impact on metrics like NDCG (Normalized Discounted Cumulative Gain) or MAP (Mean Average Precision).
Handling Cold Starts
AWS Lambda cold starts can introduce significant latency for the first request in a while. For a user-facing search API, this is often unacceptable.
Multi-tenancy Strategy
For a SaaS application, you must enforce data isolation. A common and effective pattern with this architecture is to add a tenant_id
field to your OpenSearch documents.
Modified Index Mapping:
{
"mappings": {
"properties": {
"tenant_id": { "type": "keyword" },
...
}
}
}
Then, every query must be wrapped in a bool
query with a filter
clause.
Modified k-NN Query for Multi-tenancy:
{
"size": 20,
"_source": false,
"query": {
"bool": {
"must": [
{
"knn": {
"embedding": {
"vector": [0.1, 0.2, ...],
"k": 20
}
}
}
],
"filter": [
{
"term": {
"tenant_id": "tenant-abc-123"
}
}
]
}
}
}
This filtering is highly efficient in OpenSearch. The tenant_id
can be extracted from a JWT token or another authentication mechanism in your API Gateway or Lambda authorizer.
When One Search System Returns No Results
Our RRF implementation handles this gracefully. If, for example, the BM25 query finds no matches for a very abstract semantic query, bm25_results
will be an empty list. The RRF function will simply iterate over it, add nothing to the scores, and proceed with the k-NN results. The final ranking will be based solely on the k-NN results, which is the desired behavior. This is a significant advantage over score-based fusion methods that might struggle with missing score values.
Conclusion
By moving from a single-modality vector search to a serverless hybrid search architecture with Reciprocal Rank Fusion, we have built a system that is far more robust, relevant, and capable of handling the diverse queries seen in production environments. This pattern leverages the strengths of both lexical and semantic search while mitigating their individual weaknesses. The use of AWS Lambda for parallel orchestration ensures low latency, and the rank-based nature of RRF provides a stable, easily maintainable fusion strategy that avoids the pitfalls of complex score normalization.
This architecture is not a theoretical exercise; it is a production-ready blueprint for building next-generation search experiences. It provides the relevance-ranking backbone for sophisticated RAG systems, e-commerce search platforms, and enterprise knowledge bases, delivering a user experience that is simply unattainable with keyword or vector search alone.