Advanced Hybrid Search: Fusing BM25 and SBERT Vectors in Weaviate

21 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Semantic Search Fallacy in Production

As senior engineers, we've moved past the initial excitement of vector search. The magic of finding documents based on conceptual meaning is powerful, but production workloads quickly reveal its Achilles' heel: keyword blindness. A state-of-the-art sentence-transformer model might brilliantly connect "AI ethics concerns" with "responsible machine learning development," but it will often fail spectacularly when asked to find a document containing the specific error code 0x80070005 or the product SKU ZN-5B2-TX.

This is because dense vector embeddings, by their nature, abstract away lexical details into a high-dimensional semantic space. This abstraction is their strength and their critical weakness.

Conversely, traditional lexical search, epitomized by algorithms like BM25, excels at this precise matching. It is deterministic and reliable for keyword-based queries. However, it completely lacks semantic understanding. A BM25-powered search will not understand that "laptop power issues" is conceptually identical to "computer battery problems" unless those exact words are present.

Production-grade search cannot afford to choose one over the other. It requires a sophisticated fusion of both. This article provides a deep, implementation-focused guide to building such a system using Weaviate, focusing on the fusion of BM25 (sparse vectors) and SBERT-generated dense vectors. We will dissect the architecture, tuning parameters, fusion algorithms, and performance implications necessary to deploy this robustly.


Architecting for Fusion: Schema and Environment

Our foundation is a Weaviate instance configured to handle both dense and sparse vectors within a single collection. We'll use Docker to create a reproducible environment.

1. Docker Compose for a Hybrid-Ready Weaviate Instance

This configuration enables the text2vec-transformers module, which will automatically generate dense vectors for us using a pre-trained SBERT model. It also sets up authentication and persistence.

yaml
# docker-compose.yml
version: '3.4'
services:
  weaviate:
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    image: semitechnologies/weaviate:1.23.7
    ports:
      - 8080:8080
      - 50051:50051
    restart: on-failure:0
    volumes:
      - ./weaviate_data:/var/lib/weaviate
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers,generative-openai'
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
      CLUSTER_HOSTNAME: 'node1'
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
    environment:
      ENABLE_CUDA: '0'

Launch this with docker-compose up -d. The key takeaway here is that Weaviate orchestrates the vectorization process via the t2v-transformers container, which hosts the SBERT model.

2. The Hybrid Collection Schema

A correctly defined schema is critical. It must instruct Weaviate to create two distinct types of indexes for the same property: an HNSW index for the dense vectors and an inverted index for the sparse vectors (BM25).

Here is the JSON schema for our TechnicalDoc collection:

json
{
  "class": "TechnicalDoc",
  "description": "A collection of technical documentation snippets",
  "vectorizer": "text2vec-transformers",
  "moduleConfig": {
    "text2vec-transformers": {
      "model": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
      "vectorizeClassName": false
    }
  },
  "properties": [
    {
      "name": "content",
      "dataType": ["text"],
      "description": "The main content of the document snippet"
    },
    {
      "name": "doc_id",
      "dataType": ["text"],
      "description": "A unique identifier for the document",
      "tokenization": "keyword"
    },
    {
        "name": "category",
        "dataType": ["text"], 
        "description": "Category of the document",
        "tokenization": "word"
    }
  ],
  "invertedIndexConfig": {
    "bm25": {
      "b": 0.75,
      "k1": 1.2
    }
  }
}

Critical Schema Breakdown:

* "vectorizer": "text2vec-transformers": This top-level key tells Weaviate to use the specified module to generate dense vectors for any data imported into this class.

* "invertedIndexConfig": This is the crucial section for enabling sparse search. By defining the bm25 block, we instruct Weaviate to build a BM25 index on the text properties.

* "bm25": { "b": 0.75, "k1": 1.2 }: These are the standard BM25 tuning parameters. k1 controls term frequency saturation (higher values mean TF matters more), and b controls the influence of document length normalization. For most use cases, the defaults are a reasonable starting point, but tuning them on a representative dataset can yield significant relevance gains.

We can create this schema using the Python client.


Part 1: Data Ingestion and Indexing Pipeline

Let's build a realistic ingestion pipeline. Our dataset will consist of fictional technical documentation snippets that include both conceptual text and specific, keyword-like identifiers.

The Dataset

python
# sample_data.py
data = [
    {"doc_id": "DOC-001", "category": "Kubernetes", "content": "To resolve the ImagePullBackOff error, first check the pod's events using 'kubectl describe pod <pod-name>'. This often indicates an incorrect image name, tag, or a private registry secret issue."},
    {"doc_id": "DOC-002", "category": "PostgreSQL", "content": "The FATAL: 'remaining connection slots are reserved for non-replication superuser connections' error means you have exhausted the default connection pool. You need to increase 'max_connections' in your postgresql.conf file."},
    {"doc_id": "DOC-003", "category": "Networking", "content": "A TCP connection reset by peer, often seen as error code ERR_CONNECTION_RESET, means the other side of the connection abruptly closed its end of the socket. This can be caused by firewall rules or application crashes."},
    {"doc_id": "DOC-004", "category": "Python", "content": "The GIL (Global Interpreter Lock) in CPython is a mutex that protects access to Python objects, preventing multiple native threads from executing Python bytecodes at the same time. This simplifies memory management but limits parallelism on multi-core CPUs for CPU-bound tasks."},
    {"doc_id": "DOC-005", "category": "Kubernetes", "content": "When deploying stateful applications like databases on Kubernetes, it is best practice to use a StatefulSet controller. This provides stable network identifiers and persistent storage guarantees for your pods."},
    {"doc_id": "DOC-006", "category": "PostgreSQL", "content": "For optimizing slow queries in PostgreSQL, the EXPLAIN ANALYZE command is indispensable. It provides a detailed execution plan, showing how the database is accessing tables and using indexes. Look for sequential scans on large tables."},
    {"doc_id": "DOC-007", "category": "Networking", "content": "Understanding the OSI model is fundamental for network troubleshooting. Layer 4, the Transport Layer, is responsible for end-to-end communication and error recovery, primarily using TCP and UDP protocols."},
    {"doc_id": "DOC-008", "category": "Python", "content": "Asynchronous programming in Python using asyncio allows for concurrent execution of tasks without relying on threads. The 'async' and 'await' keywords are used to define and manage coroutines, which are ideal for I/O-bound operations."}
]

Production-Grade Ingestion Script

This script connects to Weaviate, defines the schema if it doesn't exist, and ingests the data using atomic batching for efficiency and error handling.

python
# ingest.py
import weaviate
import json
from sample_data import data

# Configuration
WEAVIATE_URL = "http://localhost:8080"
CLASS_NAME = "TechnicalDoc"

# Connect to Weaviate
try:
    client = weaviate.Client(WEAVIATE_URL)
    print("Successfully connected to Weaviate.")
except Exception as e:
    print(f"Failed to connect to Weaviate: {e}")
    exit(1)

# 1. Define the Schema
def create_schema():
    if client.schema.exists(CLASS_NAME):
        print(f"Schema '{CLASS_NAME}' already exists. Skipping creation.")
        # Optional: Delete if you want a clean slate
        # client.schema.delete_class(CLASS_NAME)
        # print("Deleted existing schema.")
    else:
        print(f"Creating schema for class: {CLASS_NAME}")
        class_obj = {
            "class": CLASS_NAME,
            "description": "A collection of technical documentation snippets",
            "vectorizer": "text2vec-transformers",
            "moduleConfig": {
                "text2vec-transformers": {
                    "model": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
                    "vectorizeClassName": False
                }
            },
            "properties": [
                {"name": "content", "dataType": ["text"]},
                {"name": "doc_id", "dataType": ["text"], "tokenization": "keyword"},
                {"name": "category", "dataType": ["text"], "tokenization": "word"}
            ],
            "invertedIndexConfig": {
                "bm25": {
                    "b": 0.75,
                    "k1": 1.2
                }
            }
        }
        client.schema.create_class(class_obj)
        print("Schema created successfully.")

# 2. Ingest Data with Batching
def ingest_data():
    print("Starting data ingestion...")
    # Configure a batch process
    with client.batch as batch:
        batch.batch_size = 100
        batch.dynamic = True # dynamically create batches
        
        for i, doc in enumerate(data):
            print(f"Adding doc {i+1}: {doc['doc_id']}")
            
            properties = {
                "doc_id": doc["doc_id"],
                "category": doc["category"],
                "content": doc["content"]
            }
            
            # Weaviate will automatically call the text2vec-transformers module
            # to vectorize the 'content' property and store it.
            # It will also build the BM25 index on the same property.
            batch.add_data_object(
                data_object=properties,
                class_name=CLASS_NAME
            )
            
    print(f"Ingestion complete. {len(client.data_object.get()['objects'])} objects in collection.")

if __name__ == "__main__":
    create_schema()
    ingest_data()

When this script runs, Weaviate performs two critical actions for each object:

  • Dense Vectorization: It sends the content property to the t2v-transformers module. The SBERT model converts the text into a 384-dimension dense vector, which is then indexed in an HNSW graph.
  • Sparse Indexing: It tokenizes the content property and updates its internal inverted index to power BM25 search.
  • This dual-indexing is the core enabler of hybrid search.


    Part 2: The Art of the Hybrid Query

    With our data indexed, we can now explore the power and nuance of hybrid queries. We will demonstrate the failure modes of pure search types and show how the hybrid approach provides a superior solution.

    Query Scenario 1: The Keyword Failure of Pure Vector Search

    Let's search for a very specific technical term that relies on lexical matching.

    Query: ERR_CONNECTION_RESET

    python
    # query.py
    import weaviate
    import json
    
    client = weaviate.Client("http://localhost:8080")
    CLASS_NAME = "TechnicalDoc"
    
    def pretty_print(result):
        print(json.dumps(result, indent=2))
    
    # --- Pure Vector (Dense) Search ---
    print("\n--- 1. PURE VECTOR SEARCH (SEMANTIC) ---")
    query_text = "ERR_CONNECTION_RESET"
    
    response = (
        client.query
        .get(CLASS_NAME, ["content", "doc_id"])
        .with_near_text({"concepts": [query_text]})
        .with_limit(3)
        .do()
    )
    
    pretty_print(response)

    Expected (and often actual) a poor result:

    json
    {
      "data": {
        "Get": {
          "TechnicalDoc": [
            {
              "content": "A TCP connection reset by peer... seen as error code ERR_CONNECTION_RESET...",
              "doc_id": "DOC-003"
            },
            {
              "content": "The FATAL: 'remaining connection slots are reserved...' error means you have exhausted the default connection pool...",
              "doc_id": "DOC-002"
            },
            {
              "content": "Understanding the OSI model is fundamental for network troubleshooting...",
              "doc_id": "DOC-007"
            }
          ]
        }
      }
    }

    While DOC-003 is correctly identified, the other results are semantically related to "connections" and "errors" but miss the specific keyword. The embedding model sees "error" and "connection" and pulls in related documents, diluting the importance of the exact string ERR_CONNECTION_RESET. This is unacceptable in a technical search context.

    Query Scenario 2: The Semantic Failure of Pure Keyword Search

    Now, let's try a conceptual query where keywords don't align perfectly.

    Query: database scaling problems

    python
    # query.py (continued)
    
    # --- Pure Keyword (Sparse) Search ---
    print("\n--- 2. PURE KEYWORD SEARCH (BM25) ---")
    query_text = "database scaling problems"
    
    response = (
        client.query
        .get(CLASS_NAME, ["content", "doc_id"])
        .with_bm25(query=query_text)
        .with_limit(3)
        .do()
    )
    
    pretty_print(response)

    Expected Result:

    json
    {
      "data": {
        "Get": {
          "TechnicalDoc": [
            {
              "content": "When deploying stateful applications like databases on Kubernetes...",
              "doc_id": "DOC-005"
            }
          ]
        }
      }
    }

    BM25 only finds DOC-005 because it contains the word "databases". It completely misses DOC-002, which discusses a classic database scaling issue (max_connections), because the words "scaling" or "problems" are not present. The search is literal and lacks intelligence.

    Query Scenario 3: The Hybrid Solution

    This is where we fuse the two approaches. The with_hybrid operator in Weaviate allows us to combine a text query (which triggers both dense and sparse searches) and control their relative importance with the alpha parameter.

    * alpha = 1: Pure vector (dense) search.

    * alpha = 0: Pure keyword (sparse) search.

    * alpha = 0.5: Equal blend of both.

    Let's re-run our conceptual query with a balanced hybrid search.

    Query: database scaling problems with alpha = 0.5

    python
    # query.py (continued)
    
    # --- Hybrid Search ---
    print("\n--- 3. HYBRID SEARCH (alpha=0.5) ---")
    query_text = "database scaling problems"
    
    response = (
        client.query
        .get(CLASS_NAME, ["content", "doc_id"])
        .with_hybrid(
            query=query_text,
            alpha=0.5, # 0 (keyword) to 1 (vector)
            fusion_type=weaviate.gql.HybridFusion.RANKED # More on this later
        )
        .with_additional(["score", "explainScore"])
        .with_limit(3)
        .do()
    )
    
    pretty_print(response)

    Expected Superior Result:

    json
    {
      "data": {
        "Get": {
          "TechnicalDoc": [
            {
              "_additional": {
                "explainScore": "hybrid score: 0.016393442, fusion type: rankedFusion",
                "score": "0.016393442"
              },
              "content": "The FATAL: 'remaining connection slots are reserved...' error means you have exhausted the default connection pool...",
              "doc_id": "DOC-002"
            },
            {
              "_additional": {
                "explainScore": "hybrid score: 0.012121212, fusion type: rankedFusion",
                "score": "0.012121212"
              },
              "content": "When deploying stateful applications like databases on Kubernetes...",
              "doc_id": "DOC-005"
            }
          ]
        }
      }
    }

    Success! The hybrid query correctly identifies DOC-002 as the top result. The semantic search component understood that "database scaling problems" is conceptually related to "exhausted the default connection pool," while the keyword component gave a boost to DOC-005 for containing the word "databases." The fusion algorithm combined these signals to produce a highly relevant ranking.

    Now, let's fix our keyword search problem.

    Query: ERR_CONNECTION_RESET with alpha = 0.25 (leaning towards keyword)

    python
    # query.py (continued)
    
    print("\n--- 4. HYBRID SEARCH (alpha=0.25 for keyword emphasis) ---")
    query_text = "ERR_CONNECTION_RESET"
    
    response = (
        client.query
        .get(CLASS_NAME, ["content", "doc_id"])
        .with_hybrid(
            query=query_text,
            alpha=0.25
        )
        .with_additional(["score"])
        .with_limit(3)
        .do()
    )
    
    pretty_print(response)

    By setting alpha to 0.25, we tell Weaviate that the BM25 score should be weighted more heavily than the vector similarity score. This ensures that the document containing the exact keyword phrase rises to the top, while still allowing semantically similar results to appear lower in the ranking.


    Advanced Topic: Dissecting Fusion Algorithms

    Weaviate offers two primary fusion algorithms: rankedFusion and relativeScoreFusion. Understanding their mechanics is crucial for fine-tuning production systems.

    1. Ranked Fusion (Reciprocal Rank Fusion - RRF)

    This is the default and often the most effective method. RRF does not look at the raw scores from the dense and sparse searches. Instead, it only considers the rank of each document in the respective result lists.

    The formula is: RRF_Score(doc) = 1 / (k + rank_dense) + 1 / (k + rank_sparse)

    * rank_dense: The position of the document in the dense search results.

    * rank_sparse: The position of the document in the sparse search results.

    * k: A constant (usually 60 in Weaviate) that dampens the influence of high ranks.

    Why is this powerful? It completely sidesteps the problem of score normalization. BM25 scores can range from 0 to a high positive number, while cosine similarity scores are typically between -1 and 1. Comparing them directly is meaningless. RRF normalizes by using rank, making it robust and easy to reason about. The alpha parameter is applied before the ranking to determine how many results to fetch from each search type, but the final fusion is rank-based.

    2. Relative Score Fusion

    This method attempts to normalize the scores from each search type into a 0-1 range and then combines them with a weighted sum based on the alpha parameter.

    How it works (simplified):

    • Run both dense and sparse searches.
    • For each result set, find the maximum score.
    • Normalize every document's score by dividing it by the maximum score in its set.
  • Calculate the final score: Final_Score = (alpha norm_dense_score) + ((1 - alpha) norm_sparse_score)
  • When to use it? relativeScoreFusion can be useful when you have a strong reason to believe that the magnitude of the scores is meaningful. For example, if a BM25 score is extremely high (indicating a perfect keyword match on a rare term), you might want that magnitude to have a greater impact than its mere rank. However, it's often more brittle than RRF because it's sensitive to score distribution outliers.

    Edge Case: If the top result from one search has a score 100x higher than the second result, it will squash the normalized scores of all other documents from that search, potentially skewing the final hybrid results. RRF is immune to this problem.


    Performance and Production Considerations

    1. Indexing Overhead

    Enabling hybrid search has a clear cost: you are building and maintaining two separate indexes.

    * Disk Space: The HNSW graph for dense vectors and the inverted index for sparse vectors will both consume disk space. The vector cache is often the largest component.

    * Ingestion Time: Ingestion will be slower as Weaviate must perform both vectorization (a CPU/GPU-intensive task) and tokenization/indexing (a CPU-intensive task).

    For large-scale deployments, this requires careful capacity planning. However, the query-time benefits almost always outweigh the indexing cost.

    2. Query Latency

    A hybrid query is effectively two queries followed by a lightweight fusion step. Therefore, the latency of a hybrid query will be approximately max(latency_dense, latency_sparse) + latency_fusion.

    * Dense Search Latency: Largely dependent on the size of the HNSW graph and the ef (search-time exploration factor) parameter.

    * Sparse Search Latency: Dependent on the complexity of the query and the size of the inverted index.

    * Fusion Latency: The RRF calculation is extremely fast and adds negligible overhead.

    In a well-tuned system, hybrid query latency should be very close to the latency of the slower of the two individual searches.

    3. Benchmarking Your Implementation

    Never assume default parameters are optimal. A simple benchmarking script can provide invaluable insights.

    python
    # benchmark.py
    import time
    import weaviate
    
    client = weaviate.Client("http://localhost:8080")
    CLASS_NAME = "TechnicalDoc"
    
    QUERIES = [
        "kubernetes pod error",
        "how to fix database connection limit",
        "GIL performance impact",
        "ERR_CONNECTION_RESET",
        "asynchronous programming benefits"
    ]
    
    def run_benchmark(query_fn, description):
        start_time = time.time()
        for _ in range(10): # Run multiple iterations
            for query in QUERIES:
                query_fn(query)
        end_time = time.time()
        duration = end_time - start_time
        qps = (len(QUERIES) * 10) / duration
        print(f"--- {description} ---")
        print(f"Total time for 50 queries: {duration:.2f}s")
        print(f"Queries Per Second (QPS): {qps:.2f}\n")
    
    def bm25_query(q):
        client.query.get(CLASS_NAME, ["doc_id"]).with_bm25(query=q).do()
    
    def vector_query(q):
        client.query.get(CLASS_NAME, ["doc_id"]).with_near_text({"concepts": [q]}).do()
    
    def hybrid_query(q):
        client.query.get(CLASS_NAME, ["doc_id"]).with_hybrid(query=q, alpha=0.5).do()
    
    if __name__ == "__main__":
        run_benchmark(bm25_query, "BM25 Search")
        run_benchmark(vector_query, "Vector Search")
        run_benchmark(hybrid_query, "Hybrid Search (alpha=0.5)")

    Running this script will give you a concrete performance baseline for your specific hardware and dataset, allowing you to make informed decisions about resource allocation and optimization.

    Final Thoughts: A New Baseline for Search

    Hybrid search is not a niche feature; it is rapidly becoming the baseline expectation for any serious search implementation. By fusing the lexical precision of sparse vectors with the conceptual understanding of dense vectors, we create a system that is robust, intelligent, and far more aligned with user intent.

    For senior engineers, the task is not merely to enable a feature flag. It is to understand the underlying mechanics of indexing, the subtle but critical differences between fusion algorithms like RRF and relative scoring, and the art of tuning the alpha parameter based on query analysis and user feedback. The production-ready patterns discussed here—from the Dockerized environment and schema design to batched ingestion and performance benchmarking—provide the blueprint for building search systems that are truly greater than the sum of their parts.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles