Advanced RAG: Hybrid Search via Recursive Embedding & Sparse Vectors

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production RAG Quality Ceiling

If you've built a Retrieval-Augmented Generation (RAG) system beyond a simple demo, you've hit the wall. Your proof-of-concept worked beautifully on clean, well-structured markdown, but in production, it's failing. It hallucinates details, misses obvious answers buried in your documents, and provides frustratingly generic responses. The problem is almost never the Large Language Model (LLM); it's the retrieval. The quality of your generation is fundamentally capped by the quality of the context you provide, and that context is a direct result of your retrieval pipeline.

Basic RAG—characterized by fixed-size chunking and a generic embedding model like OpenAI's text-embedding-ada-002—is a fragile foundation. Fixed-size chunks arbitrarily slice through sentences, tables, and code blocks, destroying the very context you need to retrieve. Generic embedding models, while powerful, lack the domain-specific nuance to understand the difference between a financial projection and a marketing projection if your corpus is dense with internal jargon.

This article is for engineers who have moved past the 'hello world' of RAG and are facing the production challenges of retrieval quality. We will dissect and implement a multi-layered, advanced retrieval strategy that directly addresses the shortcomings of naive approaches. We will focus on three critical pillars:

  • Intelligent Chunking: Moving from arbitrary splits to context-aware segmentation using recursive and semantic strategies.
  • Purpose-Built Embeddings: Understanding when and how to fine-tune embedding models for your specific domain.
  • Hybrid Search Fusion: Combining the semantic power of dense vector search with the keyword precision of sparse vector search (like BM25) using Reciprocal Rank Fusion (RRF).
  • This is not a theoretical overview. We will write the code, examine the trade-offs, and build a retrieval system architected for the complexity of real-world data.


    Part 1: The Semantic Boundary Problem: Beyond Fixed-Size Chunking

    The most common point of failure in RAG systems is the initial data preparation step: chunking. A fixed-size chunker with overlap is simple to implement but semantically ignorant. It acts as a blunt instrument, creating contextually fragmented artifacts.

    Consider this text snippet:

    The system's primary authentication mechanism is JWT-based. Upon successful login, the /auth/token endpoint returns an access token with a 15-minute expiry and a refresh token valid for 7 days. The refresh token must be stored securely in an HttpOnly cookie. Failure to refresh before the access token expires will result in a 401 Unauthorized error, requiring the user to log in again.

    A fixed-size chunker of 50 words might split this text as follows:

    * Chunk 1: "The system's primary authentication mechanism is JWT-based. Upon successful login, the /auth/token endpoint returns an access token with a 15-minute expiry and a refresh token valid for 7 days. The refresh token must be stored securely in an HttpOnly cookie. Failure to refresh before the access..."

    * Chunk 2: "...token expires will result in a 401 Unauthorized error, requiring the user to log in again."

    If a user asks, "What happens when my JWT expires?", a vector search might retrieve Chunk 2. The LLM would see the 401 Unauthorized error but would have absolutely no context about the refresh token mechanism detailed in Chunk 1. The resulting answer would be incomplete and factually misleading. This is the semantic boundary problem in a nutshell.

    Strategy 1: Recursive Character Text Splitting

    A significant improvement is to split text hierarchically based on common semantic boundaries. The RecursiveCharacterTextSplitter is a popular implementation of this idea. It attempts to split based on a prioritized list of separators, typically starting with double newlines (paragraphs), then single newlines (lines), then spaces, and finally characters.

    The core algorithm is:

  • Define a list of separators: ["\n\n", "\n", " ", ""].
  • Try to split the entire text by the first separator (\n\n).
  • For each resulting split, if it's still larger than the desired chunk_size, recursively call the splitter on that piece of text using the next separator in the list (\n).
  • This continues until the chunks are smaller than chunk_size or we run out of separators.
  • This approach respects paragraph and line breaks, making it far more likely to keep related sentences together.

    python
    # Note: While libraries like LangChain provide this, understanding the implementation is key.
    from typing import List
    
    def recursive_character_split(text: str, chunk_size: int, chunk_overlap: int, separators: List[str] = None) -> List[str]:
        if separators is None:
            separators = ["\n\n", "\n", " ", ""]
    
        final_chunks = []
        
        # Get the appropriate separator
        separator = separators[-1]
        for s in separators:
            if s == "":
                separator = s
                break
            if s in text:
                separator = s
                break
    
        # Split the text by the separator
        if separator:
            splits = text.split(separator)
        else:
            splits = list(text)
    
        # Now, recursively process the splits
        good_splits = []
        _buffer = ""
        for s in splits:
            if len(_buffer) + len(s) > chunk_size:
                # If the buffer is already too long, we have a problem with a single long split
                if len(_buffer) > 0:
                    good_splits.append(_buffer)
                _buffer = s
            else:
                _buffer += (separator + s) if _buffer else s
        
        if _buffer:
            good_splits.append(_buffer)
    
        # Handle overlap and recursive calls
        for chunk in good_splits:
            if len(chunk) > chunk_size:
                # Recurse with the next separator
                next_separators = separators[separators.index(separator) + 1:]
                final_chunks.extend(recursive_character_split(chunk, chunk_size, chunk_overlap, next_separators))
            else:
                final_chunks.append(chunk)
                
        # A simplified overlap implementation would be more complex in production
        # This is a conceptual example focusing on the splitting logic.
        return final_chunks
    
    # Example Usage
    document = "First paragraph.\n\nSecond paragraph which is a bit longer.\nIt has multiple lines.\n\nThird paragraph."
    chunks = recursive_character_split(document, chunk_size=40, chunk_overlap=10)
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1}: '{chunk}' (Length: {len(chunk)} )")
    
    # Expected Output:
    # Chunk 1: 'First paragraph.' (Length: 16 )
    # Chunk 2: 'Second paragraph which is a bit longer.' (Length: 38 )
    # Chunk 3: 'It has multiple lines.' (Length: 23 )
    # Chunk 4: 'Third paragraph.' (Length: 16 )

    Production Consideration: The chunk_overlap parameter is critical. It creates a sliding window of context between chunks. A well-chosen overlap (e.g., 10-20% of chunk size) ensures that sentences at the boundary of a split are fully present in at least one of the chunks, mitigating the risk of context fragmentation.

    Strategy 2: Semantic Chunking with Embedding Thresholds

    Recursive splitting is good, but it's still based on syntactic structure, not semantic meaning. The next evolution is semantic chunking. The core idea is to group sentences based on their conceptual similarity.

    The algorithm is as follows:

    • Split the document into individual sentences.
    • Generate an embedding for each sentence.
    • Iterate through the sentences, comparing the embedding of the current sentence to the next one using cosine similarity.
    • If the similarity is above a certain threshold, the sentences are semantically related and should belong to the same chunk.
  • When the similarity between two consecutive sentences drops below the threshold, it signifies a topic change. This is a semantic boundary, and we should create a new chunk.
  • This method is computationally expensive during ingestion but produces incredibly coherent chunks that align with the conceptual flow of the document.

    python
    import numpy as np
    from sentence_transformers import SentenceTransformer
    import re
    
    # Ensure you have the model installed: pip install sentence-transformers
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    def cosine_similarity(v1, v2):
        return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    
    def semantic_chunker(text: str, similarity_threshold: float = 0.85):
        # A simple sentence splitter
        sentences = re.split(r'(?<=[.!?])\s+', text.replace("\n", " "))
        sentences = [s for s in sentences if s] # Remove empty strings
    
        if not sentences:
            return []
    
        embeddings = model.encode(sentences)
        
        chunks = []
        current_chunk_sentences = [sentences[0]]
        
        for i in range(1, len(sentences)):
            # Compare current sentence with the previous one
            similarity = cosine_similarity(embeddings[i], embeddings[i-1])
            
            if similarity >= similarity_threshold:
                current_chunk_sentences.append(sentences[i])
            else:
                # Similarity dropped, so we end the current chunk
                chunks.append(" ".join(current_chunk_sentences))
                current_chunk_sentences = [sentences[i]]
                
        # Add the last chunk
        chunks.append(" ".join(current_chunk_sentences))
        
        return chunks
    
    # Example with a clear topic shift
    document = "The sun is a star. It provides light and heat to our planet. Photosynthesis in plants depends on sunlight. " \
               "Separately, the moon is Earth's only natural satellite. It controls the tides through its gravitational pull."
    
    chunks = semantic_chunker(document, similarity_threshold=0.5) # Lower threshold for demo
    
    print("--- Semantic Chunks ---")
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1}: {chunk}")
    
    # Expected Output:
    # --- Semantic Chunks ---
    # Chunk 1: The sun is a star. It provides light and heat to our planet. Photosynthesis in plants depends on sunlight.
    # Chunk 2: Separately, the moon is Earth's only natural satellite. It controls the tides through its gravitational pull.

    Performance Edge Case: The choice of similarity_threshold is a hyperparameter you must tune for your specific dataset. A high threshold results in many small, highly-focused chunks. A low threshold creates larger, more general chunks. You may need different thresholds for different document types (e.g., legal contracts vs. technical manuals).


    Part 2: Hybrid Search - Fusing Dense and Sparse Vectors

    Pure vector search excels at finding semantically similar concepts. A query for "how to secure my web server" will find documents about nginx hardening, firewall configuration, and TLS certificates even if those exact keywords aren't used. However, it often fails on queries that require keyword precision.

    Imagine searching a technical knowledge base for an error code like ERR_CONNECTION_TIMED_OUT. A vector search might return documents about general networking issues, latency, or timeouts, but could easily miss the one document that explicitly mentions that exact error string if the surrounding text isn't semantically aligned with the user's query.

    This is where sparse vector retrieval, most commonly implemented with the BM25 algorithm, shines. BM25 is a ranking function based on term frequency (TF) and inverse document frequency (IDF) that excels at matching exact keywords. It's the workhorse behind traditional search engines like Elasticsearch and Lucene.

    Hybrid search is the practice of running both a dense vector search and a sparse keyword search in parallel and then intelligently fusing the results. This gives you the best of both worlds: semantic understanding and keyword precision.

    The Fusion Algorithm: Reciprocal Rank Fusion (RRF)

    How do you combine two lists of ranked results with completely different scoring systems? You can't just add the scores. The solution is a rank-based fusion algorithm. Reciprocal Rank Fusion (RRF) is a simple, effective, and production-proven method.

    RRF disregards the actual scores and only considers the rank of each document in the result lists. The formula for the RRF score of a document d is:

    RRF_score(d) = Σ (1 / (k + rank_i(d)))

    Where:

    * rank_i(d) is the rank of document d in the i-th result list (e.g., from BM25 or vector search).

    * k is a constant used to mitigate the impact of high ranks (i.e., documents ranked 1st or 2nd) from dominating the final score. A common value for k is 60.

    Let's implement it. It's surprisingly straightforward.

    python
    def reciprocal_rank_fusion(results: List[List[str]], k: int = 60) -> dict:
        """
        Performs Reciprocal Rank Fusion on a list of ranked result lists.
    
        Args:
            results: A list where each element is a list of document IDs, ranked by a search system.
            k: The constant used in the RRF formula.
    
        Returns:
            A dictionary of document IDs to their fused RRF scores, sorted in descending order.
        """
        fused_scores = {}
        
        # results = [ ["docA", "docB", "docC"], ["docB", "docD", "docA"] ]
        for doc_list in results:
            for rank, doc_id in enumerate(doc_list, 1):
                if doc_id not in fused_scores:
                    fused_scores[doc_id] = 0
                fused_scores[doc_id] += 1 / (k + rank)
                
        # Sort the results by score in descending order
        reranked_results = {doc_id: score for doc_id, score in sorted(
            fused_scores.items(), key=lambda item: item[1], reverse=True
        )}
        
        return reranked_results
    
    # --- Example Usage ---
    # Let's simulate results from two systems
    
    # Query: "python http timeout error"
    
    # Vector Search (semantic) - finds concepts related to timeouts
    vector_results = ["doc_C", "doc_A", "doc_F"]
    # doc_C: "Handling network latency in Python requests..."
    # doc_A: "Best practices for setting timeouts in Python's HTTPX library..."
    # doc_F: "A guide to resilient microservice communication..."
    
    # BM25 Search (keyword) - finds the exact error string
    bm25_results = ["doc_B", "doc_A", "doc_E"]
    # doc_B: "Troubleshooting 'requests.exceptions.ConnectTimeout' in Python..."
    # doc_A: "Best practices for setting timeouts in Python's HTTPX library..."
    # doc_E: "Common Python error codes and their meanings..."
    
    print("Vector Search Results:", vector_results)
    print("BM25 Search Results:", bm25_results)
    
    fused = reciprocal_rank_fusion([vector_results, bm25_results])
    
    print("\n--- Fused and Re-ranked Results ---")
    for doc_id, score in fused.items():
        print(f"{doc_id}: Score {score:.4f}")
    
    # Expected Output:
    # doc_A gets the highest score because it appeared high in BOTH lists.
    # doc_B and doc_C follow, each having a high rank in one list.
    # --- Fused and Re-ranked Results ---
    # doc_A: Score 0.0325
    # doc_C: Score 0.0164
    # doc_B: Score 0.0164
    # doc_F: Score 0.0159
    # doc_E: Score 0.0159

    Notice how doc_A, which was relevant to both semantic meaning and keywords, was promoted to the top. This is the power of hybrid search.

    Production Architecture with Weaviate

    To implement this in production, you need a database that supports both dense and sparse vector search. Weaviate is an excellent open-source choice that has first-class support for hybrid search.

    Here's a conceptual, runnable example of a full pipeline:

    python
    import weaviate
    import weaviate.classes as wvc
    import os
    
    # This example requires a running Weaviate instance (e.g., via Docker)
    # docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/weaviate/weaviate:latest
    
    # Connect to Weaviate
    # Make sure to set OPENAI_API_KEY environment variable if using OpenAI embeddings
    client = weaviate.connect_to_local()
    
    # Define the collection
    collection_name = "TechDocs"
    if client.collections.exists(collection_name):
        client.collections.delete(collection_name)
    
    tech_docs = client.collections.create(
        name=collection_name,
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
        generative_config=wvc.config.Configure.Generative.openai()
    )
    
    # Ingest Data
    docs = [
        {"content": "Best practices for setting timeouts in Python's HTTPX library.", "doc_id": "doc_A"},
        {"content": "Troubleshooting 'requests.exceptions.ConnectTimeout' in Python.", "doc_id": "doc_B"},
        {"content": "Handling network latency in Python requests.", "doc_id": "doc_C"},
        {"content": "A deep dive into SQL injection vulnerabilities.", "doc_id": "doc_D"},
        {"content": "Common Python error codes and their meanings.", "doc_id": "doc_E"},
        {"content": "A guide to resilient microservice communication.", "doc_id": "doc_F"}
    ]
    
    # Weaviate automatically creates dense vectors from the 'content' property
    with tech_docs.batch.dynamic() as batch:
        for doc in docs:
            properties = {
                "content": doc["content"],
                "doc_id": doc["doc_id"]
            }
            batch.add_object(properties=properties)
    
    # --- Perform Hybrid Search --- 
    query = "python http timeout error"
    
    response = tech_docs.query.hybrid(
        query=query,
        limit=3,
        # The alpha parameter controls the weighting between sparse (BM25) and dense (vector) search
        # alpha=1 means pure vector search, alpha=0 means pure BM25 search.
        alpha=0.5 # A 50/50 balance
    )
    
    print(f"Hybrid Search Results for query: '{query}'")
    for item in response.objects:
        print(f"  Document ID: {item.properties['doc_id']}")
        print(f"  Content: {item.properties['content']}")
        print(f"  Score: {item.metadata.score:.4f}\n")
    
    client.close()

    When you run this, you'll see that the results are a balanced mix, prioritizing documents that satisfy both the semantic and keyword aspects of the query. The alpha parameter is your key tuning knob for production performance, allowing you to bias towards keyword or semantic search depending on your users' query patterns.


    Part 3: Advanced Considerations and Edge Cases

    Building a robust retrieval pipeline involves more than just chunking and searching. Here are critical patterns senior engineers must consider.

    Re-ranking with Cross-Encoders

    Your hybrid search with RRF will give you a high-quality list of, say, 20 candidate documents. Before passing these to the LLM, you can add a final, high-precision re-ranking step. Cross-encoders are perfect for this.

    Unlike the bi-encoders (like SentenceTransformer) used for generating embeddings, which process the query and document independently, a cross-encoder takes both the query and a candidate document as a single input. This allows it to perform a much deeper, more computationally expensive analysis of their relationship, resulting in a highly accurate relevance score.

    The Pattern:

    • Retrieve the top K (e.g., K=50) documents using your fast hybrid search.
  • For each of these 50 documents, pass the pair (query, document_content) to a cross-encoder model.
    • The cross-encoder outputs a single score from 0 to 1 indicating relevance.
    • Sort the 50 documents by this new score.
    • Take the top N (e.g., N=5) re-ranked documents and pass them to the LLM.

    This adds a bit of latency to the query but can dramatically increase the signal-to-noise ratio of the final context, preventing irrelevant but semantically-close documents from polluting the LLM's prompt.

    python
    # pip install sentence-transformers
    from sentence_transformers.cross_encoder import CrossEncoder
    
    # Use a model pre-trained for relevance ranking
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    query = "python http timeout error"
    candidate_docs_content = [
        "Handling network latency in Python requests.", # From vector search
        "Troubleshooting 'requests.exceptions.ConnectTimeout' in Python.", # From BM25
        "A deep dive into SQL injection vulnerabilities.", # Irrelevant candidate
        "Best practices for setting timeouts in Python's HTTPX library." # From both
    ]
    
    # Create pairs of [query, document]
    model_inputs = [[query, doc] for doc in candidate_docs_content]
    
    # Predict scores
    scores = cross_encoder.predict(model_inputs)
    
    # Pair scores with documents and sort
    ranked_results = sorted(zip(scores, candidate_docs_content), reverse=True)
    
    print("--- Cross-Encoder Re-ranked Results ---")
    for score, doc in ranked_results:
        print(f"Score: {score:.4f} - {doc}")

    Vector Index Management and Updates

    In a real-world system, your knowledge base is not static. Documents are added, updated, and deleted. Simply adding new vectors is easy, but handling updates and deletes is a classic hard problem in vector databases.

    * Updates: Most vector indexes (like HNSW, the algorithm used by Weaviate) are immutable. You cannot efficiently update a vector in place. The standard pattern is a delete-then-insert. This can be inefficient and lead to index fragmentation.

    * Deletes: Deleting a vector is also costly. Many systems use a "soft delete" approach where vectors are marked with a tombstone flag and are filtered out at query time. The actual removal happens during periodic index rebuilds.

    Production Strategy: For systems with high churn, architect your ingestion pipeline for periodic, full re-indexing. For example, you might run a nightly or weekly batch job that rebuilds the entire index from the source of truth. This ensures optimal index health and performance, at the cost of ingestion latency.

    Conclusion: Retrieval is the Bedrock of RAG

    Moving a RAG system from a demo to a reliable production service is an exercise in systems engineering, not just prompt engineering. The LLM is powerful, but it's a garbage-in, garbage-out system. The quality of its output is inextricably linked to the relevance and coherence of the context you retrieve.

    By abandoning naive fixed-size chunking in favor of recursive and semantic splitting, you preserve the contextual integrity of your source data. By implementing hybrid search with RRF, you combine the semantic recall of dense vectors with the keyword precision of sparse vectors, creating a system that is robust to varied user queries. Finally, by adding a cross-encoder re-ranking step, you ensure that only the absolute best context makes it to the final prompt.

    These patterns—intelligent chunking, hybrid fusion, and re-ranking—form the foundation of a production-grade retrieval pipeline. They are the difference between a RAG system that occasionally works and one that is a reliable, accurate, and indispensable tool.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles