Advanced Hybrid Search: Fusing BM25 and Dense Vectors in Vespa

21 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Dichotomy of Search: Semantic vs. Lexical Relevance

In modern information retrieval, the core challenge is no longer about finding documents with matching terms, but about understanding user intent. Senior engineers recognize that production search systems operate on a spectrum between lexical and semantic relevance. On one end, we have sparse retrieval methods like BM25, which are computationally efficient, highly precise for exact keyword matches, and form the backbone of traditional search. On the other, we have dense retrieval using vector embeddings, which captures semantic relationships and user intent, enabling searches like "summer clothes for warm weather" to match a product titled "lightweight linen shirt."

However, deploying either in isolation creates significant blind spots in a production environment, particularly in e-commerce:

* BM25 Failure Mode: A query for a specific model number like "RTX 4090" might fail to retrieve a product titled "NVIDIA GeForce RTX-4090 GPU" if the tokenizer splits the terms differently. It has no concept that GPU and graphics card are semantically identical.

* Dense Vector Failure Mode: A query for "iPhone 14 Pro case" might over-generalize and return cases for the "iPhone 14" or even "iPhone 13 Pro" if their vector representations are close in the embedding space. It can miss the critical, non-negotiable keyword "Pro".

True state-of-the-art search requires a fusion of both worlds. The goal is not just to run two separate queries and merge the results, but to create a unified ranking system that leverages the strengths of both signals simultaneously. This is hybrid search. We will use Vespa, an open-source serving engine that is uniquely suited for this task due to its native support for both sparse and dense tensors and its highly customizable ranking framework.

This article assumes you are familiar with the concepts of BM25, vector embeddings (e.g., from Sentence-BERT), and the basics of search engine architecture. We will dive directly into the implementation of a sophisticated hybrid search system for an e-commerce product catalog.


Section 1: Architecting the Hybrid Index in Vespa

The foundation of our system is the Vespa schema, defined in a .sd (schema definition) file. This file dictates how data is stored, indexed, and made available for querying and ranking. Our schema must efficiently support both BM25 calculations on text fields and Approximate Nearest Neighbor (ANN) search on vector embeddings.

Let's define product.sd for our e-commerce catalog.

sd
# product.sd
schema product {

    document product {
        field product_id is string {
            indexing: summary | attribute
            attribute: fast-search
        }

        field title is string {
            indexing: index | summary
            index: enable-bm25
        }

        field description is string {
            indexing: index | summary
            index: enable-bm25
        }

        field brand is string {
            indexing: summary | attribute
            attribute: fast-search
        }

        # Dense vector for semantic search
        field embedding is tensor<float>(d0[384]) {
            indexing: attribute | index
            attribute {
                distance-metric: angular
            }
            index {
                hnsw {
                    max-links-per-node: 16
                    neighbors-to-explore-at-insert: 200
                }
            }
        }
    }

    # Fieldset for combined BM25 search across multiple fields
    fieldset default {
        fields: title, description
    }

    # --- Rank Profiles for Hybrid Search --- #

    rank-profile default {
        first-phase {
            expression: bm25(default)
        }
    }

    # Rank profile for pure vector search
    rank-profile semantic_search {
        first-phase {
            expression: closeness(embedding)
        }
        inputs {
            query(q_embedding) tensor<float>(d0[384])
        }
    }

    # Initial naive hybrid rank profile
    rank-profile hybrid_retrieval {
        first-phase {
            expression: bm25(default) + closeness(embedding)
        }
        inputs {
            query(q_embedding) tensor<float>(d0[384])
        }
    }

    # Advanced, normalized hybrid rank profile
    rank-profile normalized_hybrid_search {
        first-phase {
            expression: normalize_linear(bm25(default)) + (1.5 * closeness(embedding))
        }
        inputs {
            query(q_embedding) tensor<float>(d0[384])
        }
    }
}

Deconstructing the Schema

  • Document Fields (product_id, title, etc.): Standard fields for our products. For title and description, we specify indexing: index to create an inverted index and index: enable-bm25 to make BM25 scoring features available.
  • The Dense Vector Field (embedding): This is the core of the semantic component.
  • * tensor(d0[384]): This defines a 1-dimensional tensor (a vector) of 384 floating-point numbers. The dimension 384 must match the output of our embedding model (e.g., all-MiniLM-L6-v2).

    * indexing: attribute | index: attribute stores the tensor in memory for fast access during ranking and is required for distance calculations. index builds the ANN index.

    * distance-metric: angular: We use angular distance (which is mathematically related to cosine similarity) as it's standard for Sentence-BERT style embeddings. Other options include euclidean or dotproduct.

    * index { hnsw { ... } }: This is critical. We are telling Vespa to build an HNSW (Hierarchical Navigable Small World) graph for this tensor field. This is the data structure that enables fast ANN search. The max-links-per-node and neighbors-to-explore-at-insert parameters are performance tuning knobs we will discuss later.

  • Rank Profiles: This is where Vespa's power lies. A rank profile defines how documents are scored. We have defined four:
  • * default: A standard BM25 search.

    * semantic_search: A pure vector search using the closeness(field) feature, which is derived from the distance metric (for angular, closeness = 1 - angular_distance). It requires a query tensor q_embedding as input.

    * hybrid_retrieval: Our first, naive attempt at hybrid search. It simply adds the raw BM25 score to the raw closeness score. This is a common pitfall. These scores are on vastly different scales; BM25 scores can be > 10, while closeness is always between 0 and 1. This naive addition will be overwhelmingly dominated by the BM25 score.

    * normalized_hybrid_search: A more sophisticated approach. We use Vespa's built-in normalize_linear function on the BM25 score to scale it to a [0, 1] range. We also apply a weight (1.5) to the closeness score to give it more influence. This weight is a hyperparameter that must be tuned based on real-world relevance testing.


    Section 2: The Data Ingestion and Embedding Pipeline

    With the schema defined, we need a pipeline to generate embeddings and feed data into Vespa. This is typically a background process that listens for product catalog updates.

    For this example, we'll use the sentence-transformers library in Python to generate embeddings and format the data for Vespa's JSON feed format.

    Python Script for Embedding Generation and Feeding

    python
    import json
    import requests
    from sentence_transformers import SentenceTransformer
    from typing import List, Dict
    
    # 1. Load a pre-trained sentence transformer model
    # This model outputs 384-dimensional embeddings.
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # 2. Example product data from our catalog
    products = [
        {
            "id": "1",
            "fields": {
                "product_id": "prod-001",
                "title": "Lightweight Linen Summer Shirt",
                "description": "A breathable shirt made from 100% organic linen, perfect for hot and humid climates. Available in blue and white.",
                "brand": "SummerWear"
            }
        },
        {
            "id": "2",
            "fields": {
                "product_id": "prod-002",
                "title": "Heavy-Duty Winter Parka",
                "description": "Stay warm in sub-zero temperatures with this goose-down insulated parka. Features a waterproof shell and fur-lined hood.",
                "brand": "ArcticGear"
            }
        },
        {
            "id": "3",
            "fields": {
                "product_id": "prod-003",
                "title": "V-Neck Cotton T-Shirt",
                "description": "A classic everyday essential. Made from soft, pre-shrunk cotton for a perfect fit.",
                "brand": "BasicsCo"
            }
        }
    ]
    
    # 3. Generate embeddings
    # In a production system, this would be done in a separate service.
    # We combine title and description for a richer semantic representation.
    texts_to_embed = [f"{p['fields']['title']}. {p['fields']['description']}" for p in products]
    embeddings = model.encode(texts_to_embed)
    
    # 4. Prepare the Vespa feed JSON
    vespa_feed = []
    for i, product in enumerate(products):
        product_id = product['fields']['product_id']
        doc = {
            "put": f"id:product:product::{product_id}",
            "fields": {
                **product['fields'],
                "embedding": {
                    "values": embeddings[i].tolist()
                }
            }
        }
        vespa_feed.append(doc)
    
    # 5. Write to a JSON file for feeding
    feed_file_path = "/tmp/vespa_feed.json"
    with open(feed_file_path, 'w') as f:
        for item in vespa_feed:
            f.write(json.dumps(item) + '\n')
    
    print(f"Feed file generated at {feed_file_path}")
    
    # In a real application, you would use vespa-feed-client for high-performance feeding.
    # For demonstration, we can use a simple HTTP POST to the /document/v1/ endpoint.
    # Example command line: 
    # curl -s -X POST --data-binary @/tmp/vespa_feed.json \
    #   -H "Content-Type:application/json" \
    #   http://localhost:8080/document/v1/product/product/docid
    
    # You can also use Python requests for smaller batches:
    vespa_endpoint = "http://localhost:8080/document/v1/product/product/docid/"
    headers = {"Content-Type": "application/json"}
    
    for doc in vespa_feed:
        doc_id = doc['fields']['product_id']
        # The 'put' key is for the CLI format, we remove it for the HTTP API
        payload = {"fields": doc['fields']}
        response = requests.post(f"{vespa_endpoint}{doc_id}", data=json.dumps(payload), headers=headers)
        if response.status_code != 200:
            print(f"Failed to feed document {doc_id}: {response.text}")
        else:
            print(f"Successfully fed document {doc_id}")
    

    Production Ingestion Pipeline Considerations

    * Decoupling: The embedding generation service should be decoupled from the main application logic. When a product is updated, a message (e.g., containing the product ID) should be sent to a queue (like RabbitMQ or Kafka). The embedding service consumes from this queue, generates the new embedding, and pushes the update to Vespa.

    * Batching: For large-scale indexing, batching is crucial. The vespa-feed-client utility is highly optimized for this, handling batching, compression, and resilient feeding with retries.

    * Partial Updates: Vespa supports partial updates (update operation instead of put), which is highly efficient. If only the product's price changes, you don't need to re-calculate the embedding or re-index all the text fields. You can send an update for just the price field.


    Section 3: Crafting Hybrid Queries with YQL and Rank Profiles

    With data indexed, we can now execute queries. We'll use Vespa's query language, YQL (Yahoo Query Language), and target the different rank profiles we defined.

    First, let's generate a query embedding for "clothes for warm weather".

    python
    import numpy as np
    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    query_text = "clothes for warm weather"
    query_embedding = model.encode(query_text).tolist()
    
    # print(query_embedding) # This is the vector we will send in our query

    The Query Structure

    The core of a hybrid query in Vespa involves two parts:

  • A where clause to select a candidate set of documents. This is where we use both keyword and vector search operators.
  • A ranking parameter to specify which rank profile to use for scoring the candidate set.
  • Here's the YQL for our hybrid search. We use the OR operator to create a candidate set from documents that match either the keyword search or the vector search.

    select * from product where (userQuery() OR ({targetHits:100}nearestNeighbor(embedding, q_embedding)))

    * userQuery(): This is a YQL placeholder that Vespa populates with the user's text query, which will be matched against the default fieldset.

    * nearestNeighbor(embedding, q_embedding): This is the ANN search operator. It finds documents whose embedding vector is closest to the query vector q_embedding.

    {targetHits:100}: This is a crucial performance annotation. It tells the ANN algorithm to find approximately* 100 nearest neighbors. This prunes the search space significantly. Without it, the search would be much slower. The candidate set for ranking will be the union of the keyword matches and these 100 nearest neighbors.

    Executing Queries

    Let's construct the full HTTP GET requests to see the difference between our rank profiles.

    Query: "summer shirt"

    Semantic Query: "clothes for warm weather"

  • Pure BM25 Search (default rank profile)
  • bash
        curl -s -X GET 'http://localhost:8080/search/?yql=select * from product where userQuery();&query=summer shirt'

    Expected Result: Will match prod-001 with a high BM25 score. Will not match for the query "clothes for warm weather".

  • Pure Semantic Search (semantic_search rank profile)
  • bash
        # The query vector needs to be passed as a parameter
        # Assuming QUERY_VECTOR is the JSON array of our embedding
        curl -s -X GET 'http://localhost:8080/search/' --data-urlencode 'yql=select * from product where {targetHits:100}nearestNeighbor(embedding, q_embedding);' --data-urlencode 'ranking=semantic_search' --data-urlencode 'input.query(q_embedding)=[...]' # Paste the query vector here

    Expected Result: For the query "clothes for warm weather", this will rank prod-001 (Lightweight Linen Summer Shirt) highest due to high semantic similarity. prod-002 (Winter Parka) will have a very low score.

  • Normalized Hybrid Search (normalized_hybrid_search rank profile)
  • This is where we combine both signals. We'll use the query "linen shirt for hot days".

    * Lexical signal: "linen" and "shirt" will be picked up by BM25.

    * Semantic signal: "for hot days" will be captured by the embedding.

    bash
        # Assuming QUERY_VECTOR is for "linen shirt for hot days"
        curl -s -X GET 'http://localhost:8080/search/' \
          --data-urlencode 'yql=select * from product where (userQuery() OR ({targetHits:100}nearestNeighbor(embedding, q_embedding)));' \
          --data-urlencode 'query=linen shirt for hot days' \
          --data-urlencode 'ranking=normalized_hybrid_search' \
          --data-urlencode 'input.query(q_embedding)=[...]' # Paste the query vector here

    How it Works:

    * The yql clause gathers candidates. prod-001 will be retrieved by both userQuery() (matching "linen" and "shirt") and nearestNeighbor().

    * The normalized_hybrid_search rank profile is then executed on all candidates.

    * For prod-001, both bm25(default) and closeness(embedding) will have high scores. Their normalized and weighted sum will result in a very high final relevance score.

    * Another product, like a "silk blouse", might be retrieved by the nearestNeighbor operator due to some semantic similarity, but its bm25(default) score will be zero, resulting in a lower final rank.


    Section 4: Performance Tuning and Production Considerations

    Deploying a hybrid system at scale requires careful tuning of the ANN index, memory management, and query execution.

    HNSW Index Tuning

    The HNSW index has two primary tuning parameters in the schema:

    * max-links-per-node: The maximum number of connections (edges) each node in the HNSW graph can have. Higher values create a denser, more accurate graph but increase index size and build time. A typical range is 16-48.

    * neighbors-to-explore-at-insert: The size of the candidate list used when searching for neighbors during graph construction. Higher values lead to a better quality index (higher recall) at the cost of significantly longer indexing times.

    Benchmarking Strategy:

    To tune these, you need an offline evaluation set with known ground truth (e.g., a list of queries and their expected relevant documents). You can then script Vespa deployments with different HNSW parameters, feed the same corpus, and measure:

  • Recall@K: What percentage of the true nearest neighbors are found in the top K results?
  • Query Latency: The p95 or p99 query time.
  • Index Build Time: How long it takes to feed the entire corpus.
  • Index Size: The on-disk and in-memory footprint of the index.
  • You will typically find a trade-off curve between recall and latency. For e-commerce search, a slight dip in recall (e.g., from 99% to 97%) is often acceptable for a significant latency reduction (e.g., 80ms to 40ms).

    Query-Time Performance: `targetHits`

    The targetHits parameter in the nearestNeighbor operator is your most important lever for controlling query latency. It directly limits the number of nodes the ANN search will visit. The optimal value depends on the desired recall for the vector portion of your search. If your dense vectors are just for a slight semantic boost, a lower targetHits (e.g., 50-100) might be sufficient. If semantic search is the primary discovery mechanism, you may need a higher value (e.g., 500-1000), which will increase latency.

    Memory Management and Quantization

    Dense vector indexes are memory-intensive. A 384-dimensional float vector requires 384 * 4 bytes = 1536 bytes per document. For a catalog of 10 million products, this is over 15 GB of RAM just for the raw vectors, not including the HNSW graph overhead.

    Strategies to manage this:

  • Hardware Scaling: Vespa scales horizontally. You can add more content nodes, and Vespa will automatically distribute the documents and their corresponding vector indexes.
  • Vector Dimensionality: Use a model that produces smaller embeddings (e.g., 128 dimensions instead of 768) if your relevance metrics show it's acceptable.
  • Mixed-Precision & Quantization: Vespa supports bfloat16 tensors, which cut the memory usage of vectors in half with a minimal loss in precision.
  • Modify the schema:

    field embedding is tensor(d0[384]) { ... }

    This is a simple change that can provide a huge memory saving. More advanced techniques like Product Quantization (PQ) are also on the horizon for many vector database systems, which can compress vectors by 10x or more, but Vespa's native support for this is still evolving.


    Section 5: Handling Edge Cases and Advanced Scenarios

    Multi-Vector Representation

    A single embedding for an entire product is often too simplistic. A product's image conveys different information than its text description. A powerful pattern is to use multiple vector fields.

    Example Scenario: Use a text embedding model (like Sentence-BERT) for the title/description and an image embedding model (like CLIP) for the product image.

    Schema Modification:

    sd
    # ... in schema product ...
        field text_embedding is tensor<float>(d0[384]) { ... }
        field image_embedding is tensor<float>(d0[512]) { ... }

    Advanced Rank Profile:

    sd
    # ... in schema product ...
    rank-profile multi_vector_hybrid {
        first-phase {
            expression: normalize_linear(bm25(default)) + (1.5 * closeness(text_embedding)) + (1.0 * closeness(image_embedding))
        }
        inputs {
            query(q_text_embedding) tensor<float>(d0[384])
            query(q_image_embedding) tensor<float>(d0[512])
        }
    }

    In this scenario, the query pipeline becomes more complex. If the user provides a text query, you generate a text embedding. If they upload an image for a visual search, you generate an image embedding. You can then pass both to the query and let the rank profile fuse the signals. The weights (1.5 and 1.0) become critical for tuning the relative importance of text vs. image similarity.

    Real-time Indexing Dynamics

    E-commerce catalogs are dynamic. Prices and stock levels change constantly. A key strength of Vespa is its ability to handle real-time updates without performance degradation.

    When a document is fed to Vespa, it's written to a transaction log and held in an in-memory index. It becomes searchable almost instantly. The HNSW graph, however, is not immediately modified for every small update. Vespa periodically flushes the in-memory index to disk and merges it with the main on-disk index structures. During this merge, the HNSW graph is efficiently updated in batches. This architecture provides both excellent write throughput and low-latency reads, avoiding the costly re-indexing process required by some other systems.

    Conclusion

    Building a production-grade hybrid search system is a non-trivial engineering task that moves beyond simple API calls to a vector database. It requires a deep understanding of the underlying data structures, a flexible ranking framework, and a robust operations model. By leveraging Vespa, we were able to:

  • Define a unified schema for both sparse and dense data.
  • Implement a custom, tunable ranking function that intelligently fuses BM25 and vector similarity scores, addressing the critical score normalization problem.
  • Construct a query that efficiently retrieves candidates from both indexes using a union of keyword and ANN search results.
  • Explore advanced, production-critical topics like HNSW index tuning, memory optimization with bfloat16, and multi-vector architectures.
  • The key takeaway for senior engineers is that the fusion logic—encapsulated in the rank-profile—is where the competitive advantage is built. It's an iterative process of defining ranking expressions, testing them against real-world user queries, and using relevance metrics to tune the weights and formulas until the search experience feels truly intelligent.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles