Production RAG: Hybrid Search with PostgreSQL, pgvector & BM25

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Achilles' Heel of Pure Semantic Search in RAG

Retrieval-Augmented Generation (RAG) has become a cornerstone for building LLM applications that reason over private or real-time data. The standard architecture—chunking documents, embedding them into vectors, and performing a similarity search to find context for a prompt—is powerful. However, senior engineers deploying these systems into production quickly discover a critical flaw: pure semantic search, typically based on cosine similarity over dense vectors, is often terrible at retrieving information based on specific, literal keywords.

Consider queries involving:

* Product SKUs: ABC-XT-2024

* Error Codes: 0x80070005

* Specific Names or Acronyms: Project Calypso or GDPR

* Technical Identifiers: CVE-2023-4863

An embedding model, trained to capture semantic meaning, might map CVE-2023-4863 and a critical Chrome vulnerability to similar locations in the vector space. But if the user queries for the exact identifier, a pure vector search might fail to rank the document containing that exact string at the very top, especially if other documents discuss vulnerabilities in a semantically similar way. This is a relevance disaster in production.

Let's illustrate with a concrete, albeit simplified, example. Imagine our knowledge base contains these two text chunks:

  • "The investigation into the 'Heimdall' security incident (ref: HMDL-2024-01) revealed a buffer overflow vulnerability. The patch was released in Q3."
  • "A critical security flaw was discovered in our primary authentication service, leading to a widespread system compromise. This event highlighted the need for better memory safety protocols."
  • A user query details on incident HMDL-2024-01 should unequivocally return chunk #1. However, a standard sentence-transformer model might see strong semantic overlap between the query and chunk #2 (both discuss "security incidents" and "vulnerabilities"). The specific, keyword-like nature of HMDL-2024-01 can be diluted.

    python
    # Simplified demonstration of the problem
    from sentence_transformers import SentenceTransformer, util
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    chunks = [
        "The investigation into the 'Heimdall' security incident (ref: HMDL-2024-01) revealed a buffer overflow vulnerability. The patch was released in Q3.",
        "A critical security flaw was discovered in our primary authentication service, leading to a widespread system compromise. This event highlighted the need for better memory safety protocols.",
        "Quarterly budget report for the platform engineering team shows increased spending on monitoring tools."
    ]
    
    query = "details on incident HMDL-2024-01"
    
    # Encode chunks and query
    chunk_embeddings = model.encode(chunks, convert_to_tensor=True)
    query_embedding = model.encode(query, convert_to_tensor=True)
    
    # Compute cosine similarities
    cosine_scores = util.pytorch_cos_sim(query_embedding, chunk_embeddings)[0]
    
    print("Query:", query)
    for i, score in enumerate(cosine_scores):
        print(f"Score: {score:.4f} - Chunk {i+1}: {chunks[i][:30]}...")
    
    # In many runs, the scores can be surprisingly close, or even wrong.
    # A potential output might be:
    # Query: details on incident HMDL-2024-01
    # Score: 0.6812 - Chunk 1: The investigation into the 'He...
    # Score: 0.6590 - Chunk 2: A critical security flaw was d...
    # Score: 0.1234 - Chunk 3: Quarterly budget report for t...

    While chunk #1 wins here, the margin is slim. In a larger corpus with thousands of documents about security, it's highly plausible that a more general document about incidents could outscore the specific one. The solution is not to abandon semantic search, but to augment it with a robust lexical search mechanism. This is the essence of hybrid search.

    Architecting a Hybrid Search RAG Pipeline with PostgreSQL

    We will build a production-grade hybrid search system using a single, powerful data store: PostgreSQL, supercharged with the pgvector extension. Our architecture will fuse two retrieval methods:

  • Dense Retrieval (Semantic Search): Using pgvector to store and index high-dimensional vectors generated by a sentence-transformer model. This captures the meaning and intent of the query.
  • Sparse Retrieval (Lexical Search): Using PostgreSQL's built-in Full-Text Search (FTS) capabilities. FTS is highly optimized for finding keywords and phrases and can serve as an excellent, practical approximation of classic ranking algorithms like BM25.
  • The magic happens not in the database queries themselves, but in how we merge the results from these two disparate systems. We will use Reciprocal Rank Fusion (RRF), a simple and powerful algorithm that combines ranked lists without needing to normalize incomparable scores (like cosine similarity and ts_rank).

    Here's a high-level view of our RAG pipeline:

    Ingestion Pipeline:

  • Document Loader: Reads source documents (Markdown, PDF, etc.).
  • Text Splitter: Chunks documents into manageable, overlapping pieces.
  • Vectorization:
  • * Dense Embedder: Creates a 384-dimensional vector for each chunk using a sentence-transformer.

    * Sparse Representation: Creates a tsvector for each chunk using PostgreSQL's FTS functions.

  • Database Storage: Stores the original text, the dense vector, and the sparse tsvector in a PostgreSQL table.
  • Retrieval Pipeline:

  • User Query: A natural language question from the user.
  • Parallel Query Generation:
  • * The query is passed to the dense embedder to create a query vector.

    * The query is converted into a tsquery for FTS.

  • Dual Database Query: Two queries are executed against PostgreSQL in parallel:
  • * A k-NN (k-Nearest Neighbors) search on the vector column.

    * An FTS query on the tsvector column.

  • Fusion & Re-ranking: The two ranked lists of results are merged using the RRF algorithm in our application layer.
  • Context Assembly: The top N re-ranked chunks are formatted into a context block.
  • LLM Prompting: The context and original query are sent to an LLM for final answer generation.
  • This architecture leverages the versatility of PostgreSQL, avoiding the operational overhead of managing a separate vector database and a separate search engine like Elasticsearch.

    Deep Dive: Data Ingestion and Indexing

    Let's implement the ingestion pipeline. We'll need a PostgreSQL database with the pgvector extension enabled.

    sql
    -- Ensure the extension is created
    CREATE EXTENSION IF NOT EXISTS vector;

    Database Schema

    Our table needs to store the content, the embedding, and the tsvector representation. We'll also add metadata fields, which are crucial for production systems (e.g., source document ID, chunk sequence).

    sql
    CREATE TABLE document_chunks (
        id SERIAL PRIMARY KEY,
        source_id VARCHAR(255) NOT NULL, -- Identifier for the original document
        chunk_sequence INT NOT NULL,      -- Order of the chunk within the document
        content TEXT NOT NULL,
        embedding VECTOR(384) NOT NULL, -- Assuming all-MiniLM-L6-v2 model (384 dims)
        content_tsv TSVECTOR,
        created_at TIMESTAMPTZ DEFAULT NOW(),
        UNIQUE (source_id, chunk_sequence)
    );

    Chunking and Vectorization Strategy

    For the implementation, we'll use Python. Ensure you have the necessary libraries installed:

    bash
    pip install sentence-transformers langchain psycopg2-binary

    Our script will perform the core ingestion logic. A critical choice is the chunking strategy. RecursiveCharacterTextSplitter from LangChain is a robust choice that tries to split on sensible boundaries (paragraphs, sentences, etc.).

    python
    import psycopg2
    from psycopg2.extras import execute_values
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from sentence_transformers import SentenceTransformer
    
    # --- Configuration ---
    DB_CONNECTION_STRING = "postgresql://user:password@host:port/dbname"
    MODEL_NAME = 'all-MiniLM-L6-v2'
    
    # --- Initialize components ---
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
    )
    model = SentenceTransformer(MODEL_NAME)
    
    def ingest_document(source_id: str, document_text: str):
        """Chunks, embeds, and stores a single document."""
        print(f"Ingesting document: {source_id}")
        
        # 1. Chunk the document
        chunks = text_splitter.split_text(document_text)
        
        # 2. Generate dense embeddings
        embeddings = model.encode(chunks, show_progress_bar=True)
        
        # 3. Prepare data for insertion
        records_to_insert = []
        for i, chunk in enumerate(chunks):
            embedding = embeddings[i].tolist()
            # The content_tsv will be generated by a trigger in the DB for efficiency
            records_to_insert.append((source_id, i, chunk, embedding))
    
        # 4. Insert into database
        with psycopg2.connect(DB_CONNECTION_STRING) as conn:
            with conn.cursor() as cur:
                execute_values(
                    cur,
                    """INSERT INTO document_chunks (source_id, chunk_sequence, content, embedding)
                       VALUES %s""",
                    records_to_insert
                )
        print(f"Successfully ingested {len(chunks)} chunks for document: {source_id}")
    

    Notice we are not generating the tsvector in the Python code. While possible, it's more efficient and consistent to let PostgreSQL handle it automatically using a trigger. This ensures the tsvector is always in sync with the content.

    sql
    -- Create a function to update the tsvector column
    CREATE OR REPLACE FUNCTION update_document_chunks_tsv() RETURNS TRIGGER AS $$
    BEGIN
        NEW.content_tsv := to_tsvector('english', NEW.content);
        RETURN NEW;
    END;
    $$ LANGUAGE plpgsql;
    
    -- Create a trigger that calls the function before insert or update
    CREATE TRIGGER tsvector_update_trigger
    BEFORE INSERT OR UPDATE ON document_chunks
    FOR EACH ROW EXECUTE FUNCTION update_document_chunks_tsv();

    Now, any INSERT or UPDATE on the document_chunks table will automatically populate the content_tsv column.

    Advanced Indexing Strategy

    Query performance hinges entirely on proper indexing. We need two different types of indexes for our two query patterns.

  • Index for FTS (Sparse Retrieval): A GIN (Generalized Inverted Index) is the standard and most performant choice for tsvector data.
  • sql
        CREATE INDEX idx_gin_document_chunks_tsv ON document_chunks USING GIN(content_tsv);
  • Index for Vector Search (Dense Retrieval): pgvector supports several index types. For production workloads, HNSW (Hierarchical Navigable Small World) is generally the best choice. It offers a fantastic balance of query speed and build time/memory usage, and it provides approximate nearest neighbor (ANN) search, which is exactly what we need.
  • sql
        -- HNSW index for vector search (cosine distance)
        -- The choice of m and ef_construction are critical tuning parameters.
        -- m: max number of connections per layer (16-64 is typical)
        -- ef_construction: size of the dynamic candidate list during build (higher is better recall but slower build)
        CREATE INDEX idx_hnsw_document_chunks_embedding ON document_chunks USING HNSW (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 64);

    A Note on IVFFlat vs. HNSW: pgvector also offers IVFFlat indexes. IVFFlat works by partitioning the vector space into lists. At query time, it only searches a subset of these lists (probes). It can sometimes be faster than HNSW for very large datasets if probes is set to a small value, but this comes at a significant cost to recall. HNSW generally provides better recall-for-speed trade-offs and is easier to tune. For most RAG use cases, HNSW is the superior starting point.

    Implementing the Hybrid Search Query Logic

    With our data ingested and indexed, we can now build the retrieval component. The core challenge is combining the results from a vector search and a text search.

    The Fusion Problem: Incomparable Scores

    Vector search with vector_cosine_ops returns distance (0 to 2, where 0 is identical). We often convert this to similarity* (1 - distance / 2), ranging from 0 to 1.

    * PostgreSQL's FTS ts_rank function returns a relevance score, also typically in the 0 to 1 range, but its distribution and meaning are completely different from cosine similarity.

    Simply adding these scores is meaningless. Normalizing them (e.g., min-max scaling) is brittle because the scores for a given query depend on the entire result set. This is where Reciprocal Rank Fusion (RRF) shines.

    Reciprocal Rank Fusion (RRF)

    RRF is an elegant, zero-parameter algorithm. It disregards the raw scores and instead uses the rank of each document in the result lists. The formula for the RRF score of a document is:

    RRF_Score(doc) = Σ (1 / (k + rank_i))

    Where:

    * rank_i is the rank of the document in the i-th result list.

    * k is a constant, typically set to a small number like 60, which dampens the influence of lower-ranked items.

    For each document, we sum its reciprocal rank score across all result lists it appears in. Documents that appear high up in multiple lists get a significantly higher final score.

    Code Implementation (Python)

    Let's write a Python function that orchestrates the entire hybrid retrieval process.

    python
    import psycopg2
    from psycopg2.extras import DictCursor
    from sentence_transformers import SentenceTransformer
    
    # --- Re-use configuration from ingestion ---
    DB_CONNECTION_STRING = "postgresql://user:password@host:port/dbname"
    MODEL_NAME = 'all-MiniLM-L6-v2'
    model = SentenceTransformer(MODEL_NAME)
    
    def reciprocal_rank_fusion(results: list[list[dict]], k: int = 60) -> list[dict]:
        """Performs RRF on a list of search result lists."""
        fused_scores = {}
    
        # results is a list of lists, e.g., [[doc1, doc2], [doc2, doc3]]
        for result_set in results:
            for rank, doc in enumerate(result_set):
                doc_id = doc['id']
                if doc_id not in fused_scores:
                    fused_scores[doc_id] = {'score': 0, 'doc': doc}
                
                fused_scores[doc_id]['score'] += 1 / (k + rank)
    
        # Sort by fused score in descending order
        reranked_results = sorted(fused_scores.values(), key=lambda x: x['score'], reverse=True)
        
        return [item['doc'] for item in reranked_results]
    
    def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
        """Performs hybrid search and returns re-ranked documents."""
        
        # 1. Generate query vectors
        query_embedding = model.encode(query).tolist()
        
        # For FTS, we convert the query to a tsquery format, using '&' for AND logic
        # This is a simple approach; more sophisticated logic can be used
        query_fts = " & ".join(query.split())
        
        vector_results = []
        fts_results = []
    
        with psycopg2.connect(DB_CONNECTION_STRING, cursor_factory=DictCursor) as conn:
            with conn.cursor() as cur:
                # 2. Perform Vector Search
                # <=> is the cosine distance operator
                cur.execute(
                    """SELECT id, source_id, content, (1 - (embedding <=> %s)) as similarity
                       FROM document_chunks 
                       ORDER BY embedding <=> %s 
                       LIMIT %s""",
                    (str(query_embedding), str(query_embedding), top_k * 2) # Fetch more to allow for fusion
                )
                vector_results = cur.fetchall()
    
                # 3. Perform Full-Text Search
                cur.execute(
                    """SELECT id, source_id, content, ts_rank(content_tsv, to_tsquery('english', %s)) as rank
                       FROM document_chunks 
                       WHERE content_tsv @@ to_tsquery('english', %s) 
                       ORDER BY rank DESC 
                       LIMIT %s""",
                    (query_fts, query_fts, top_k * 2)
                )
                fts_results = cur.fetchall()
    
        # 4. Fuse the results using RRF
        # Convert psycopg2's DictRow objects to standard dicts
        vector_results_dicts = [dict(row) for row in vector_results]
        fts_results_dicts = [dict(row) for row in fts_results]
        
        reranked_results = reciprocal_rank_fusion([vector_results_dicts, fts_results_dicts])
        
        return reranked_results[:top_k]
    
    # --- Example Usage ---
    if __name__ == '__main__':
        # Assume some documents have been ingested
        # e.g., ingest_document("security_policy_v1", "The policy document HMDL-2024-01 outlines our response...")
        
        search_query = "details on incident HMDL-2024-01"
        results = hybrid_search(search_query)
        
        print(f"Hybrid search results for: '{search_query}'\n")
        for res in results:
            print(f"ID: {res['id']}, Source: {res['source_id']}")
            print(f"Content: {res['content'][:200]}...\n")

    This implementation fetches more results than top_k from each search system (top_k * 2) to provide a richer set of candidates for the fusion algorithm. RRF then intelligently re-ranks them, pushing documents that appear in both lists (e.g., are both semantically and lexically relevant) to the top.

    Production Considerations and Edge Cases

    Building this system is one thing; running it reliably in production is another.

    Performance Tuning `pgvector`

    The HNSW index has query-time parameters that can be set per-transaction. The most important is hnsw.ef_search.

    * hnsw.ef_search: (Default: ef_construction) Size of the dynamic candidate list during search. Higher values give better recall but are slower. You can tune this to trade off speed for accuracy.

    sql
    -- Example of setting it for a session
    SET hnsw.ef_search = 100;
    SELECT id FROM document_chunks ORDER BY embedding <=> '[...]' LIMIT 10;

    Benchmarking is crucial. Create a validation set of (query, expected_document_id) pairs. Write a script that runs queries with different ef_search values and measure both latency and recall@k (i.e., "is the correct document in the top k results?") to find the optimal setting for your application's requirements.

    Weighted Fusion vs. RRF

    While RRF is powerful because it's parameter-free, you might have use cases where you want to explicitly bias towards either semantic or lexical search. In these cases, you can use a weighted linear combination of normalized scores.

    Hybrid_Score = (1 - α) normalized_semantic_score + α normalized_lexical_score

    * α (alpha) is a weighting factor between 0 and 1. α > 0.5 favors lexical search; α < 0.5 favors semantic search.

    The Challenge: Normalization is hard. You can't just use the scores from a single query's result set. You would need to establish a baseline distribution of scores across many queries to perform a more statistically sound normalization (like z-score normalization), which adds significant complexity. For this reason, RRF is the recommended starting point for most production systems.

    The Final Re-ranking Step: Cross-Encoders

    For applications demanding the highest possible precision (e.g., legal or medical Q&A), RRF might not be enough. The ultimate step in relevance ranking is to use a Cross-Encoder.

    Unlike the sentence-transformer (a bi-encoder) which creates embeddings for the query and documents independently, a cross-encoder takes both the query and a candidate document as a single input and outputs a single relevance score (0 to 1). This allows it to perform much deeper attention across both texts.

    Cross-encoders are too slow to run on your entire database, but they are perfect for re-ranking the top N (e.g., top 20) candidates returned by your hybrid search.

    python
    from sentence_transformers import CrossEncoder
    
    # This model is specifically trained for re-ranking
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    def rerank_with_cross_encoder(query: str, documents: list[dict]) -> list[dict]:
        """Re-ranks a list of documents using a CrossEncoder model."""
        # The model expects a list of [query, document_content] pairs
        pairs = [[query, doc['content']] for doc in documents]
        
        # Get scores from the model
        scores = cross_encoder.predict(pairs, show_progress_bar=False)
        
        # Add scores to documents and sort
        for i, doc in enumerate(documents):
            doc['cross_encoder_score'] = scores[i]
            
        return sorted(documents, key=lambda x: x['cross_encoder_score'], reverse=True)
    
    # --- In your main workflow ---
    # 1. Get initial candidates from hybrid search
    candidates = hybrid_search(search_query, top_k=50)
    
    # 2. Re-rank the top candidates for maximum precision
    final_results = rerank_with_cross_encoder(search_query, candidates)
    
    # 3. Use the top 5-10 of these for the LLM context
    top_5_final = final_results[:5]

    This three-stage process (dense retrieval -> sparse retrieval -> RRF fusion -> cross-encoder re-ranking) represents the state-of-the-art in RAG retrieval systems.

    Tying It All Together: The Final LLM Prompt

    Once you have your highly-relevant, re-ranked list of document chunks, the final step is to construct the prompt for the LLM.

    python
    import openai
    
    # Assume openai.api_key is configured
    
    def generate_answer(query: str, context_chunks: list[dict]):
        """Generates an answer using the LLM based on the provided context."""
        
        context_str = "\n\n---\n\n".join([chunk['content'] for chunk in context_chunks])
        
        prompt = f"""
        You are an expert Q&A system. Your answer must be grounded in the provided context.
        Do not use any information outside of the context. If the context does not contain
        the answer, state that you cannot answer the question.
    
        CONTEXT:
        {context_str}
    
        QUESTION:
        {query}
    
        ANSWER:
        """
    
        response = openai.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1,
        )
        
        return response.choices[0].message.content
    
    # --- Final End-to-End Example ---
    search_query = "What was the resolution for incident HMDL-2024-01?"
    
    # 1. Hybrid search to get candidates
    candidates = hybrid_search(search_query, top_k=20)
    
    # 2. (Optional but recommended) Cross-encoder re-ranking
    reranked = rerank_with_cross_encoder(search_query, candidates)
    
    # 3. Select top N for context
    final_context = reranked[:5]
    
    # 4. Generate final answer
    answer = generate_answer(search_query, final_context)
    
    print("--- Generated Answer ---")
    print(answer)

    Conclusion

    By moving from a naive, single-stage vector search to a multi-stage, hybrid retrieval system, we build RAG pipelines that are significantly more robust and production-ready. Combining dense (semantic) and sparse (lexical) search addresses the fundamental weaknesses of each approach, providing a system that understands both conceptual intent and literal keywords.

    Using PostgreSQL with pgvector and its native Full-Text Search offers a powerful, operationally simple foundation for this advanced architecture. By layering techniques like Reciprocal Rank Fusion and optional cross-encoder re-ranking, you can build a retrieval system that delivers highly relevant context to your LLM, dramatically improving the quality and factual accuracy of your application's generated answers.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles