Serverless Hybrid Search with Reciprocal Rank Fusion on AWS

October 7, 2025

19 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Naive RAG: The Case for Production-Grade Hybrid Search

In the current AI landscape, many search-augmented generation (RAG) systems rely solely on vector similarity search. While powerful for capturing semantic meaning, this approach has a critical flaw in production: it often fails on queries requiring exact lexical matches, such as product SKUs, specific error codes, legal clauses, or proper nouns. A query for Error 0x80070005 will likely yield semantically related but incorrect results in a pure vector system. Conversely, traditional keyword search (like TF-IDF or BM25) excels at these exact matches but fails to understand nuance, synonyms, or user intent.

Production-grade search demands the best of both worlds. The solution is hybrid search: a technique that combines results from both lexical and semantic search engines. However, the real challenge lies not in running two separate queries, but in intelligently fusing their results into a single, cohesive, and relevance-ranked list.

Simple approaches like weighted score combination are notoriously difficult to tune and maintain. The scores from a BM25 query and a k-NN vector search are on different scales and have different distributions, making any linear combination fragile. This is where Reciprocal Rank Fusion (RRF) provides a more robust, elegant, and production-ready solution.

RRF is a rank-based fusion method, meaning it disregards the raw, incomparable scores from different systems and focuses solely on the position of a document in each result list. This makes it inherently resilient to score scaling issues. In this post, we will architect and implement a high-performance, cost-effective serverless hybrid search system on AWS using Amazon OpenSearch Serverless for querying and AWS Lambda for orchestration, all centered around a sophisticated RRF implementation.

The Core Architecture: A Serverless Blueprint

Our target architecture is designed for scalability, maintainability, and pay-per-use cost efficiency. It avoids managing dedicated servers by leveraging a fully serverless stack.

Components:

Amazon OpenSearch Serverless: Our unified data store. We will use a single index to store both the text content for BM25 search and the vector embeddings for k-NN search. This simplifies data management and avoids data synchronization issues between two separate systems.

AWS Lambda (Python 3.11): The orchestration layer. A single Lambda function, triggered by API Gateway, will be responsible for receiving the user query, executing the parallel lexical and semantic searches against OpenSearch, and performing the Reciprocal Rank Fusion.

Amazon API Gateway: Provides the public HTTP endpoint for our search API, handling request routing, authorization (if needed), and throttling.

Embedding Model: For this example, we'll assume an embedding model is available, either through a service like Amazon Bedrock, Cohere, or a self-hosted model on SageMaker. The focus here is on the search and fusion mechanism, not the embedding generation itself.

Here is a high-level diagram of the query flow:

mermaid

graph TD
    A[Client] -->|HTTPS Request| B(API Gateway)
    B -->|Proxy Integration| C{Search Orchestrator Lambda}
    C -->|1. Generate Query Embedding| D[Embedding Service]
    D -->|Embedding Vector| C
    subgraph Parallel Queries
        C -->|2a. BM25 Query| E[OpenSearch Serverless]
        C -->|2b. k-NN Query| E
    end
    E -->|BM25 Results| C
    E -->|k-NN Results| C
    C -->|3. Perform RRF| F[RRF Logic]
    F -->|Fused & Ranked IDs| C
    C -->|4. Fetch Documents (Optional)| E
    C -->|5. Return Fused Results| B
    B -->|JSON Response| A

Deep Dive: Reciprocal Rank Fusion (RRF)

RRF's elegance lies in its simplicity and effectiveness. The formula to calculate the RRF score for a document d is:

RRF_Score(d) = Σ (1 / (k + rank_i(d)))

Where:

rank_i(d) is the rank of document d in the result set i (e.g., the BM25 result set or the k-NN result set).

The sum is over all result sets.

k is a constant used to mitigate the impact of documents with very high ranks in a single result set dominating the final score. A common value for k is 60, as suggested in the original paper, but this is a tunable hyperparameter.

Why is this better than score-based fusion?

Consider a BM25 search returning a document with a score of 25.7 and a vector search returning the same document with a score of 0.91 (cosine similarity). How do you combine these? You could try to normalize them to a [0, 1] range, but the distributions are different. A BM25 score of 25.7 might be exceptional, while a cosine similarity of 0.91 might be mediocre for your dataset. RRF sidesteps this entirely. If the document is rank 1 in both searches, its contribution is (1 / (k + 1)) + (1 / (k + 1)), regardless of the underlying scores.

Let's implement this in Python.

python

# rrf_fusion.py

from collections import defaultdict

def reciprocal_rank_fusion(search_results: list[list[str]], k: int = 60) -> dict[str, float]:
    """
    Performs Reciprocal Rank Fusion on multiple lists of search results.

    Args:
        search_results: A list of lists, where each inner list contains document IDs
                        ordered by relevance from a single search system.
        k: A constant to tune the fusion algorithm. Defaults to 60.

    Returns:
        A dictionary mapping document IDs to their fused RRF scores.
    """
    fused_scores = defaultdict(float)

    # Each result_list is one of our searches (e.g., BM25, k-NN)
    for result_list in search_results:
        # Iterate through the ranked documents in the current list
        for rank, doc_id in enumerate(result_list, 1):
            # The core RRF formula
            fused_scores[doc_id] += 1 / (k + rank)

    return fused_scores

# Example Usage:

# BM25 results (lexical search)
bm25_results = ["doc_3", "doc_1", "doc_5"]

# k-NN results (semantic search)
knn_results = ["doc_1", "doc_4", "doc_3", "doc_2"]

all_results = [bm25_results, knn_results]

fused = reciprocal_rank_fusion(all_results)

# Sort the results by score in descending order
sorted_fused = sorted(fused.items(), key=lambda item: item[1], reverse=True)

print("BM25 Results:", bm25_results)
print("k-NN Results:", knn_results)
print("\nFused and Sorted Results:")
for doc_id, score in sorted_fused:
    print(f"  {doc_id}: {score:.4f}")

# --- Expected Output ---
# BM25 Results: ['doc_3', 'doc_1', 'doc_5']
# k-NN Results: ['doc_1', 'doc_4', 'doc_3', 'doc_2']
#
# Fused and Sorted Results:
#   doc_1: 0.0325  # rank 2 in BM25, rank 1 in k-NN
#   doc_3: 0.0322  # rank 1 in BM25, rank 3 in k-NN
#   doc_4: 0.0161  # only in k-NN at rank 2
#   doc_5: 0.0159  # only in BM25 at rank 3
#   doc_2: 0.0156  # only in k-NN at rank 4

Notice how doc_1 and doc_3, which appear in both lists, bubble to the top. doc_1 wins because its ranks (2 and 1) are better on average than doc_3's (1 and 3).

Production Implementation on AWS

1. Setting up OpenSearch Serverless

First, create an OpenSearch Serverless "collection". For this use case, we need an index that supports both text search and vector search.

Index mapping definition (index_mapping.json):

json

{
  "settings": {
    "index.knn": true,
    "index.knn.space_type": "cosinesimil"
  },
  "mappings": {
    "properties": {
      "doc_id": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "content": {
        "type": "text"
      },
      "embedding": {
        "type": "knn_vector",
        "dimension": 768, // MUST match your embedding model's dimension
        "method": {
          "name": "hnsw",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 256,
            "m": 48
          }
        }
      }
    }
  }
}

index.knn: true: Enables the k-NN plugin.

embedding.type: knn_vector: Defines the field for storing vector embeddings.

dimension: This is critical. It must exactly match the output dimension of your embedding model (e.g., 768 for all-MiniLM-L6-v2, 1536 for OpenAI text-embedding-ada-002).

method: We use hnsw with the faiss engine, a standard for high-performance approximate nearest neighbor search.

2. The Search Orchestrator Lambda

This is the heart of our system. We'll use Python with boto3 and the opensearch-py library. The key to performance is executing the BM25 and k-NN queries in parallel.

Performance Consideration: Connection Pooling

AWS Lambda freezes the execution environment between invocations. We can exploit this to reuse TCP connections and client objects, significantly reducing latency on warm invocations. We will initialize the OpenSearch client outside the main lambda_handler function.

Lambda Code (search_lambda.py):

python

import os
import json
import asyncio
from collections import defaultdict

import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

# --- Environment Variables ---
# OPENSEARCH_HOST: The OpenSearch Serverless collection endpoint (e.g., *.aoss.us-east-1.amazonaws.com)
# EMBEDDING_ENDPOINT_NAME: Name of the SageMaker endpoint for embeddings

# --- Initialization (outside handler for reuse) ---
SERVICE = 'aoss'
REGION = os.environ.get('AWS_REGION', 'us-east-1')
OPENSEARCH_HOST = os.environ['OPENSEARCH_HOST']
INDEX_NAME = 'hybrid-search-index'

credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
    credentials.access_key,
    credentials.secret_key,
    REGION,
    SERVICE,
    session_token=credentials.token
)

sagemaker_runtime = boto3.client('sagemaker-runtime')

# Initialize OpenSearch client
os_client = OpenSearch(
    hosts=[{'host': OPENSEARCH_HOST, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    pool_maxsize=20 # Important for parallel queries
)

# --- Helper Functions ---

def get_embedding(text: str) -> list[float]:
    """Invokes a SageMaker endpoint to get the text embedding."""
    endpoint_name = os.environ['EMBEDDING_ENDPOINT_NAME']
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps({'inputs': [text]})
    )
    result = json.loads(response['Body'].read().decode())
    return result[0] # Assuming endpoint returns a list of embeddings

async def run_bm25_query(query_text: str, size: int) -> list[str]:
    """Runs a standard BM25 match query."""
    query = {
        'size': size,
        '_source': False, # We only need the doc IDs for fusion
        'query': {
            'multi_match': {
                'query': query_text,
                'fields': ['title', 'content']
            }
        }
    }
    loop = asyncio.get_running_loop()
    response = await loop.run_in_executor(
        None, 
        lambda: os_client.search(index=INDEX_NAME, body=query)
    )
    return [hit['_id'] for hit in response['hits']['hits']]

async def run_knn_query(query_vector: list[float], size: int) -> list[str]:
    """Runs a k-NN vector search query."""
    query = {
        'size': size,
        '_source': False,
        'query': {
            'knn': {
                'embedding': {
                    'vector': query_vector,
                    'k': size
                }
            }
        }
    }
    loop = asyncio.get_running_loop()
    response = await loop.run_in_executor(
        None, 
        lambda: os_client.search(index=INDEX_NAME, body=query)
    )
    return [hit['_id'] for hit in response['hits']['hits']]

def reciprocal_rank_fusion(search_results: list[list[str]], k: int = 60) -> list[str]:
    """Performs RRF and returns a sorted list of doc IDs."""
    fused_scores = defaultdict(float)
    for result_list in search_results:
        for rank, doc_id in enumerate(result_list, 1):
            fused_scores[doc_id] += 1 / (k + rank)
    
    if not fused_scores:
        return []

    sorted_docs = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
    return [doc_id for doc_id, _ in sorted_docs]

# --- Main Handler ---

async def main_logic(event):
    """Main async logic for search orchestration."""
    query_params = event.get('queryStringParameters', {})
    query_text = query_params.get('q')
    if not query_text:
        return {'statusCode': 400, 'body': json.dumps({'error': 'Query parameter "q" is required'})}

    k_val = int(query_params.get('k_rrf', 60))
    size = int(query_params.get('size', 10))

    # 1. Get query embedding
    try:
        query_vector = get_embedding(query_text)
    except Exception as e:
        print(f"Error getting embedding: {e}")
        return {'statusCode': 500, 'body': json.dumps({'error': 'Failed to generate query embedding'})}

    # 2. Run queries in parallel
    tasks = [
        run_bm25_query(query_text, size=size * 2), # Fetch more results to give RRF more to work with
        run_knn_query(query_vector, size=size * 2)
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    # Error handling for parallel tasks
    bm25_results, knn_results = [], []
    if isinstance(results[0], list):
        bm25_results = results[0]
    else:
        print(f"BM25 query failed: {results[0]}")

    if isinstance(results[1], list):
        knn_results = results[1]
    else:
        print(f"k-NN query failed: {results[1]}")

    # 3. Perform Reciprocal Rank Fusion
    fused_ids = reciprocal_rank_fusion([bm25_results, knn_results], k=k_val)

    # 4. Fetch full documents for the top N fused results (optional)
    # For simplicity, we return IDs. In production, you'd fetch from DynamoDB or OpenSearch `_source`.
    final_results = fused_ids[:size]

    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps({
            'results': final_results,
            'fusion_meta': {
                'bm25_count': len(bm25_results),
                'knn_count': len(knn_results),
                'fused_count': len(final_results)
            }
        })
    }

def lambda_handler(event, context):
    return asyncio.run(main_logic(event))

Key Implementation Details:

Asynchronous Execution: We use asyncio.gather to run the lexical and semantic queries concurrently. This is crucial for minimizing latency, as the total query time becomes max(latency_bm25, latency_knn) instead of sum(latency_bm25, latency_knn).

Fetching More Results: We fetch size * 2 results from each search system. This provides more candidate documents for the fusion algorithm to re-rank, often leading to better results in the final top size list.

Graceful Error Handling: The asyncio.gather with return_exceptions=True ensures that if one search system fails (e.g., k-NN times out), the other can still provide results. The RRF function naturally handles an empty list from one of the sources.

Tunable k: The RRF k parameter is exposed via a query parameter k_rrf, allowing for dynamic tuning and experimentation without redeploying the Lambda.

3. Deployment and IAM

Lambda Configuration: The Lambda needs sufficient memory (e.g., 1024MB) and a timeout of at least 10-15 seconds to accommodate network latency and cold starts. Package the opensearch-py, requests-aws4auth dependencies in a Lambda layer or zip file.

IAM Role: The Lambda's execution role requires permissions for:

- aoss:APIAccessAll on the OpenSearch Serverless collection.

- sagemaker:InvokeEndpoint on the embedding model's endpoint.

- Standard AWSLambdaBasicExecutionRole for CloudWatch Logs.

Advanced Edge Cases and Production Considerations

Tuning the RRF `k` Constant

The k parameter in RRF controls the steepness of the rank-to-score decay curve. A smaller k gives disproportionately more weight to top-ranked items, making the fusion more sensitive to the #1 and #2 spots. A larger k flattens the curve, giving more consideration to items ranked lower in the original lists.

Low k (e.g., 2-10): Use when you have very high confidence in your search systems' top results. This can be useful in navigational queries where the top hit is almost always the correct one.

High k (e.g., 60-100): A safer, more general-purpose choice. It prevents a single system's potentially noisy top result from dominating the fusion. k=60 is a well-regarded starting point.

To tune k effectively, you need an offline evaluation set with queries and labeled relevant documents. You can then iterate through different values of k and measure the impact on metrics like NDCG (Normalized Discounted Cumulative Gain) or MAP (Mean Average Precision).

Handling Cold Starts

AWS Lambda cold starts can introduce significant latency for the first request in a while. For a user-facing search API, this is often unacceptable.

Provisioned Concurrency: The most effective solution is to configure Provisioned Concurrency for the search Lambda. This keeps a specified number of execution environments warm and ready to receive requests, virtually eliminating cold starts at the cost of a fixed hourly charge.

Priming: A less robust but cheaper method is to use an EventBridge (CloudWatch Events) rule to ping the Lambda on a schedule (e.g., every 5 minutes) to keep one instance warm. This isn't guaranteed to work if you have concurrent requests.

Multi-tenancy Strategy

For a SaaS application, you must enforce data isolation. A common and effective pattern with this architecture is to add a tenant_id field to your OpenSearch documents.

Modified Index Mapping:

json

{
  "mappings": {
    "properties": {
      "tenant_id": { "type": "keyword" },
      ...
    }
  }
}

Then, every query must be wrapped in a bool query with a filter clause.

Modified k-NN Query for Multi-tenancy:

json

{
  "size": 20,
  "_source": false,
  "query": {
    "bool": {
      "must": [
        {
          "knn": {
            "embedding": {
              "vector": [0.1, 0.2, ...],
              "k": 20
            }
          }
        }
      ],
      "filter": [
        {
          "term": {
            "tenant_id": "tenant-abc-123"
          }
        }
      ]
    }
  }
}

This filtering is highly efficient in OpenSearch. The tenant_id can be extracted from a JWT token or another authentication mechanism in your API Gateway or Lambda authorizer.

When One Search System Returns No Results

Our RRF implementation handles this gracefully. If, for example, the BM25 query finds no matches for a very abstract semantic query, bm25_results will be an empty list. The RRF function will simply iterate over it, add nothing to the scores, and proceed with the k-NN results. The final ranking will be based solely on the k-NN results, which is the desired behavior. This is a significant advantage over score-based fusion methods that might struggle with missing score values.

Conclusion

By moving from a single-modality vector search to a serverless hybrid search architecture with Reciprocal Rank Fusion, we have built a system that is far more robust, relevant, and capable of handling the diverse queries seen in production environments. This pattern leverages the strengths of both lexical and semantic search while mitigating their individual weaknesses. The use of AWS Lambda for parallel orchestration ensures low latency, and the rank-based nature of RRF provides a stable, easily maintainable fusion strategy that avoids the pitfalls of complex score normalization.

This architecture is not a theoretical exercise; it is a production-ready blueprint for building next-generation search experiences. It provides the relevance-ranking backbone for sophisticated RAG systems, e-commerce search platforms, and enterprise knowledge bases, delivering a user experience that is simply unattainable with keyword or vector search alone.

Beyond Naive RAG: The Case for Production-Grade Hybrid Search

The Core Architecture: A Serverless Blueprint

Deep Dive: Reciprocal Rank Fusion (RRF)

Production Implementation on AWS

1. Setting up OpenSearch Serverless

2. The Search Orchestrator Lambda

3. Deployment and IAM

Advanced Edge Cases and Production Considerations

Tuning the RRF `k` Constant

Handling Cold Starts

Multi-tenancy Strategy

When One Search System Returns No Results

Conclusion

Found this article helpful?