Production RAG: Hybrid Search with PostgreSQL, pgvector & BM25

October 6, 2025

20 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Achilles' Heel of Pure Semantic Search in RAG

Retrieval-Augmented Generation (RAG) has become a cornerstone for building LLM applications that reason over private or real-time data. The standard architecture—chunking documents, embedding them into vectors, and performing a similarity search to find context for a prompt—is powerful. However, senior engineers deploying these systems into production quickly discover a critical flaw: pure semantic search, typically based on cosine similarity over dense vectors, is often terrible at retrieving information based on specific, literal keywords.

Consider queries involving:

* Product SKUs: ABC-XT-2024

* Error Codes: 0x80070005

* Specific Names or Acronyms: Project Calypso or GDPR

* Technical Identifiers: CVE-2023-4863

An embedding model, trained to capture semantic meaning, might map CVE-2023-4863 and a critical Chrome vulnerability to similar locations in the vector space. But if the user queries for the exact identifier, a pure vector search might fail to rank the document containing that exact string at the very top, especially if other documents discuss vulnerabilities in a semantically similar way. This is a relevance disaster in production.

Let's illustrate with a concrete, albeit simplified, example. Imagine our knowledge base contains these two text chunks:

"The investigation into the 'Heimdall' security incident (ref: HMDL-2024-01) revealed a buffer overflow vulnerability. The patch was released in Q3."

"A critical security flaw was discovered in our primary authentication service, leading to a widespread system compromise. This event highlighted the need for better memory safety protocols."

A user query details on incident HMDL-2024-01 should unequivocally return chunk #1. However, a standard sentence-transformer model might see strong semantic overlap between the query and chunk #2 (both discuss "security incidents" and "vulnerabilities"). The specific, keyword-like nature of HMDL-2024-01 can be diluted.

python

# Simplified demonstration of the problem
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

chunks = [
    "The investigation into the 'Heimdall' security incident (ref: HMDL-2024-01) revealed a buffer overflow vulnerability. The patch was released in Q3.",
    "A critical security flaw was discovered in our primary authentication service, leading to a widespread system compromise. This event highlighted the need for better memory safety protocols.",
    "Quarterly budget report for the platform engineering team shows increased spending on monitoring tools."
]

query = "details on incident HMDL-2024-01"

# Encode chunks and query
chunk_embeddings = model.encode(chunks, convert_to_tensor=True)
query_embedding = model.encode(query, convert_to_tensor=True)

# Compute cosine similarities
cosine_scores = util.pytorch_cos_sim(query_embedding, chunk_embeddings)[0]

print("Query:", query)
for i, score in enumerate(cosine_scores):
    print(f"Score: {score:.4f} - Chunk {i+1}: {chunks[i][:30]}...")

# In many runs, the scores can be surprisingly close, or even wrong.
# A potential output might be:
# Query: details on incident HMDL-2024-01
# Score: 0.6812 - Chunk 1: The investigation into the 'He...
# Score: 0.6590 - Chunk 2: A critical security flaw was d...
# Score: 0.1234 - Chunk 3: Quarterly budget report for t...

While chunk #1 wins here, the margin is slim. In a larger corpus with thousands of documents about security, it's highly plausible that a more general document about incidents could outscore the specific one. The solution is not to abandon semantic search, but to augment it with a robust lexical search mechanism. This is the essence of hybrid search.

Architecting a Hybrid Search RAG Pipeline with PostgreSQL

We will build a production-grade hybrid search system using a single, powerful data store: PostgreSQL, supercharged with the pgvector extension. Our architecture will fuse two retrieval methods:

Dense Retrieval (Semantic Search): Using pgvector to store and index high-dimensional vectors generated by a sentence-transformer model. This captures the meaning and intent of the query.

Sparse Retrieval (Lexical Search): Using PostgreSQL's built-in Full-Text Search (FTS) capabilities. FTS is highly optimized for finding keywords and phrases and can serve as an excellent, practical approximation of classic ranking algorithms like BM25.

The magic happens not in the database queries themselves, but in how we merge the results from these two disparate systems. We will use Reciprocal Rank Fusion (RRF), a simple and powerful algorithm that combines ranked lists without needing to normalize incomparable scores (like cosine similarity and ts_rank).

Here's a high-level view of our RAG pipeline:

Ingestion Pipeline:

Document Loader: Reads source documents (Markdown, PDF, etc.).

Text Splitter: Chunks documents into manageable, overlapping pieces.

Vectorization:

* Dense Embedder: Creates a 384-dimensional vector for each chunk using a sentence-transformer.

* Sparse Representation: Creates a tsvector for each chunk using PostgreSQL's FTS functions.

Database Storage: Stores the original text, the dense vector, and the sparse tsvector in a PostgreSQL table.

Retrieval Pipeline:

User Query: A natural language question from the user.

Parallel Query Generation:

* The query is passed to the dense embedder to create a query vector.

* The query is converted into a tsquery for FTS.

Dual Database Query: Two queries are executed against PostgreSQL in parallel:

* A k-NN (k-Nearest Neighbors) search on the vector column.

* An FTS query on the tsvector column.

Fusion & Re-ranking: The two ranked lists of results are merged using the RRF algorithm in our application layer.

Context Assembly: The top N re-ranked chunks are formatted into a context block.

LLM Prompting: The context and original query are sent to an LLM for final answer generation.

This architecture leverages the versatility of PostgreSQL, avoiding the operational overhead of managing a separate vector database and a separate search engine like Elasticsearch.

Deep Dive: Data Ingestion and Indexing

Let's implement the ingestion pipeline. We'll need a PostgreSQL database with the pgvector extension enabled.

sql

-- Ensure the extension is created
CREATE EXTENSION IF NOT EXISTS vector;

Database Schema

Our table needs to store the content, the embedding, and the tsvector representation. We'll also add metadata fields, which are crucial for production systems (e.g., source document ID, chunk sequence).

sql

CREATE TABLE document_chunks (
    id SERIAL PRIMARY KEY,
    source_id VARCHAR(255) NOT NULL, -- Identifier for the original document
    chunk_sequence INT NOT NULL,      -- Order of the chunk within the document
    content TEXT NOT NULL,
    embedding VECTOR(384) NOT NULL, -- Assuming all-MiniLM-L6-v2 model (384 dims)
    content_tsv TSVECTOR,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE (source_id, chunk_sequence)
);

Chunking and Vectorization Strategy

For the implementation, we'll use Python. Ensure you have the necessary libraries installed:

bash

pip install sentence-transformers langchain psycopg2-binary

Our script will perform the core ingestion logic. A critical choice is the chunking strategy. RecursiveCharacterTextSplitter from LangChain is a robust choice that tries to split on sensible boundaries (paragraphs, sentences, etc.).

python

import psycopg2
from psycopg2.extras import execute_values
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

# --- Configuration ---
DB_CONNECTION_STRING = "postgresql://user:password@host:port/dbname"
MODEL_NAME = 'all-MiniLM-L6-v2'

# --- Initialize components ---
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)
model = SentenceTransformer(MODEL_NAME)

def ingest_document(source_id: str, document_text: str):
    """Chunks, embeds, and stores a single document."""
    print(f"Ingesting document: {source_id}")
    
    # 1. Chunk the document
    chunks = text_splitter.split_text(document_text)
    
    # 2. Generate dense embeddings
    embeddings = model.encode(chunks, show_progress_bar=True)
    
    # 3. Prepare data for insertion
    records_to_insert = []
    for i, chunk in enumerate(chunks):
        embedding = embeddings[i].tolist()
        # The content_tsv will be generated by a trigger in the DB for efficiency
        records_to_insert.append((source_id, i, chunk, embedding))

    # 4. Insert into database
    with psycopg2.connect(DB_CONNECTION_STRING) as conn:
        with conn.cursor() as cur:
            execute_values(
                cur,
                """INSERT INTO document_chunks (source_id, chunk_sequence, content, embedding)
                   VALUES %s""",
                records_to_insert
            )
    print(f"Successfully ingested {len(chunks)} chunks for document: {source_id}")

Notice we are not generating the tsvector in the Python code. While possible, it's more efficient and consistent to let PostgreSQL handle it automatically using a trigger. This ensures the tsvector is always in sync with the content.

sql

-- Create a function to update the tsvector column
CREATE OR REPLACE FUNCTION update_document_chunks_tsv() RETURNS TRIGGER AS $$
BEGIN
    NEW.content_tsv := to_tsvector('english', NEW.content);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create a trigger that calls the function before insert or update
CREATE TRIGGER tsvector_update_trigger
BEFORE INSERT OR UPDATE ON document_chunks
FOR EACH ROW EXECUTE FUNCTION update_document_chunks_tsv();

Now, any INSERT or UPDATE on the document_chunks table will automatically populate the content_tsv column.

Advanced Indexing Strategy

Query performance hinges entirely on proper indexing. We need two different types of indexes for our two query patterns.

Index for FTS (Sparse Retrieval): A GIN (Generalized Inverted Index) is the standard and most performant choice for tsvector data.

sql

    CREATE INDEX idx_gin_document_chunks_tsv ON document_chunks USING GIN(content_tsv);

Index for Vector Search (Dense Retrieval): pgvector supports several index types. For production workloads, HNSW (Hierarchical Navigable Small World) is generally the best choice. It offers a fantastic balance of query speed and build time/memory usage, and it provides approximate nearest neighbor (ANN) search, which is exactly what we need.

sql

    -- HNSW index for vector search (cosine distance)
    -- The choice of m and ef_construction are critical tuning parameters.
    -- m: max number of connections per layer (16-64 is typical)
    -- ef_construction: size of the dynamic candidate list during build (higher is better recall but slower build)
    CREATE INDEX idx_hnsw_document_chunks_embedding ON document_chunks USING HNSW (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

A Note on IVFFlat vs. HNSW: pgvector also offers IVFFlat indexes. IVFFlat works by partitioning the vector space into lists. At query time, it only searches a subset of these lists (probes). It can sometimes be faster than HNSW for very large datasets if probes is set to a small value, but this comes at a significant cost to recall. HNSW generally provides better recall-for-speed trade-offs and is easier to tune. For most RAG use cases, HNSW is the superior starting point.

Implementing the Hybrid Search Query Logic

With our data ingested and indexed, we can now build the retrieval component. The core challenge is combining the results from a vector search and a text search.

The Fusion Problem: Incomparable Scores

Vector search with vector_cosine_ops returns distance (0 to 2, where 0 is identical). We often convert this to similarity* (1 - distance / 2), ranging from 0 to 1.

* PostgreSQL's FTS ts_rank function returns a relevance score, also typically in the 0 to 1 range, but its distribution and meaning are completely different from cosine similarity.

Simply adding these scores is meaningless. Normalizing them (e.g., min-max scaling) is brittle because the scores for a given query depend on the entire result set. This is where Reciprocal Rank Fusion (RRF) shines.

Reciprocal Rank Fusion (RRF)

RRF is an elegant, zero-parameter algorithm. It disregards the raw scores and instead uses the rank of each document in the result lists. The formula for the RRF score of a document is:

RRF_Score(doc) = Σ (1 / (k + rank_i))

Where:

* rank_i is the rank of the document in the i-th result list.

* k is a constant, typically set to a small number like 60, which dampens the influence of lower-ranked items.

For each document, we sum its reciprocal rank score across all result lists it appears in. Documents that appear high up in multiple lists get a significantly higher final score.

Code Implementation (Python)

Let's write a Python function that orchestrates the entire hybrid retrieval process.

python

import psycopg2
from psycopg2.extras import DictCursor
from sentence_transformers import SentenceTransformer

# --- Re-use configuration from ingestion ---
DB_CONNECTION_STRING = "postgresql://user:password@host:port/dbname"
MODEL_NAME = 'all-MiniLM-L6-v2'
model = SentenceTransformer(MODEL_NAME)

def reciprocal_rank_fusion(results: list[list[dict]], k: int = 60) -> list[dict]:
    """Performs RRF on a list of search result lists."""
    fused_scores = {}

    # results is a list of lists, e.g., [[doc1, doc2], [doc2, doc3]]
    for result_set in results:
        for rank, doc in enumerate(result_set):
            doc_id = doc['id']
            if doc_id not in fused_scores:
                fused_scores[doc_id] = {'score': 0, 'doc': doc}
            
            fused_scores[doc_id]['score'] += 1 / (k + rank)

    # Sort by fused score in descending order
    reranked_results = sorted(fused_scores.values(), key=lambda x: x['score'], reverse=True)
    
    return [item['doc'] for item in reranked_results]

def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
    """Performs hybrid search and returns re-ranked documents."""
    
    # 1. Generate query vectors
    query_embedding = model.encode(query).tolist()
    
    # For FTS, we convert the query to a tsquery format, using '&' for AND logic
    # This is a simple approach; more sophisticated logic can be used
    query_fts = " & ".join(query.split())
    
    vector_results = []
    fts_results = []

    with psycopg2.connect(DB_CONNECTION_STRING, cursor_factory=DictCursor) as conn:
        with conn.cursor() as cur:
            # 2. Perform Vector Search
            # <=> is the cosine distance operator
            cur.execute(
                """SELECT id, source_id, content, (1 - (embedding <=> %s)) as similarity
                   FROM document_chunks 
                   ORDER BY embedding <=> %s 
                   LIMIT %s""",
                (str(query_embedding), str(query_embedding), top_k * 2) # Fetch more to allow for fusion
            )
            vector_results = cur.fetchall()

            # 3. Perform Full-Text Search
            cur.execute(
                """SELECT id, source_id, content, ts_rank(content_tsv, to_tsquery('english', %s)) as rank
                   FROM document_chunks 
                   WHERE content_tsv @@ to_tsquery('english', %s) 
                   ORDER BY rank DESC 
                   LIMIT %s""",
                (query_fts, query_fts, top_k * 2)
            )
            fts_results = cur.fetchall()

    # 4. Fuse the results using RRF
    # Convert psycopg2's DictRow objects to standard dicts
    vector_results_dicts = [dict(row) for row in vector_results]
    fts_results_dicts = [dict(row) for row in fts_results]
    
    reranked_results = reciprocal_rank_fusion([vector_results_dicts, fts_results_dicts])
    
    return reranked_results[:top_k]

# --- Example Usage ---
if __name__ == '__main__':
    # Assume some documents have been ingested
    # e.g., ingest_document("security_policy_v1", "The policy document HMDL-2024-01 outlines our response...")
    
    search_query = "details on incident HMDL-2024-01"
    results = hybrid_search(search_query)
    
    print(f"Hybrid search results for: '{search_query}'\n")
    for res in results:
        print(f"ID: {res['id']}, Source: {res['source_id']}")
        print(f"Content: {res['content'][:200]}...\n")

This implementation fetches more results than top_k from each search system (top_k * 2) to provide a richer set of candidates for the fusion algorithm. RRF then intelligently re-ranks them, pushing documents that appear in both lists (e.g., are both semantically and lexically relevant) to the top.

Production Considerations and Edge Cases

Building this system is one thing; running it reliably in production is another.

Performance Tuning `pgvector`

The HNSW index has query-time parameters that can be set per-transaction. The most important is hnsw.ef_search.

* hnsw.ef_search: (Default: ef_construction) Size of the dynamic candidate list during search. Higher values give better recall but are slower. You can tune this to trade off speed for accuracy.

sql

-- Example of setting it for a session
SET hnsw.ef_search = 100;
SELECT id FROM document_chunks ORDER BY embedding <=> '[...]' LIMIT 10;

Benchmarking is crucial. Create a validation set of (query, expected_document_id) pairs. Write a script that runs queries with different ef_search values and measure both latency and recall@k (i.e., "is the correct document in the top k results?") to find the optimal setting for your application's requirements.

Weighted Fusion vs. RRF

While RRF is powerful because it's parameter-free, you might have use cases where you want to explicitly bias towards either semantic or lexical search. In these cases, you can use a weighted linear combination of normalized scores.

Hybrid_Score = (1 - α) normalized_semantic_score + α normalized_lexical_score

* α (alpha) is a weighting factor between 0 and 1. α > 0.5 favors lexical search; α < 0.5 favors semantic search.

The Challenge: Normalization is hard. You can't just use the scores from a single query's result set. You would need to establish a baseline distribution of scores across many queries to perform a more statistically sound normalization (like z-score normalization), which adds significant complexity. For this reason, RRF is the recommended starting point for most production systems.

The Final Re-ranking Step: Cross-Encoders

For applications demanding the highest possible precision (e.g., legal or medical Q&A), RRF might not be enough. The ultimate step in relevance ranking is to use a Cross-Encoder.

Unlike the sentence-transformer (a bi-encoder) which creates embeddings for the query and documents independently, a cross-encoder takes both the query and a candidate document as a single input and outputs a single relevance score (0 to 1). This allows it to perform much deeper attention across both texts.

Cross-encoders are too slow to run on your entire database, but they are perfect for re-ranking the top N (e.g., top 20) candidates returned by your hybrid search.

python

from sentence_transformers import CrossEncoder

# This model is specifically trained for re-ranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_with_cross_encoder(query: str, documents: list[dict]) -> list[dict]:
    """Re-ranks a list of documents using a CrossEncoder model."""
    # The model expects a list of [query, document_content] pairs
    pairs = [[query, doc['content']] for doc in documents]
    
    # Get scores from the model
    scores = cross_encoder.predict(pairs, show_progress_bar=False)
    
    # Add scores to documents and sort
    for i, doc in enumerate(documents):
        doc['cross_encoder_score'] = scores[i]
        
    return sorted(documents, key=lambda x: x['cross_encoder_score'], reverse=True)

# --- In your main workflow ---
# 1. Get initial candidates from hybrid search
candidates = hybrid_search(search_query, top_k=50)

# 2. Re-rank the top candidates for maximum precision
final_results = rerank_with_cross_encoder(search_query, candidates)

# 3. Use the top 5-10 of these for the LLM context
top_5_final = final_results[:5]

This three-stage process (dense retrieval -> sparse retrieval -> RRF fusion -> cross-encoder re-ranking) represents the state-of-the-art in RAG retrieval systems.

Tying It All Together: The Final LLM Prompt

Once you have your highly-relevant, re-ranked list of document chunks, the final step is to construct the prompt for the LLM.

python

import openai

# Assume openai.api_key is configured

def generate_answer(query: str, context_chunks: list[dict]):
    """Generates an answer using the LLM based on the provided context."""
    
    context_str = "\n\n---\n\n".join([chunk['content'] for chunk in context_chunks])
    
    prompt = f"""
    You are an expert Q&A system. Your answer must be grounded in the provided context.
    Do not use any information outside of the context. If the context does not contain
    the answer, state that you cannot answer the question.

    CONTEXT:
    {context_str}

    QUESTION:
    {query}

    ANSWER:
    """

    response = openai.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.1,
    )
    
    return response.choices[0].message.content

# --- Final End-to-End Example ---
search_query = "What was the resolution for incident HMDL-2024-01?"

# 1. Hybrid search to get candidates
candidates = hybrid_search(search_query, top_k=20)

# 2. (Optional but recommended) Cross-encoder re-ranking
reranked = rerank_with_cross_encoder(search_query, candidates)

# 3. Select top N for context
final_context = reranked[:5]

# 4. Generate final answer
answer = generate_answer(search_query, final_context)

print("--- Generated Answer ---")
print(answer)

Conclusion

By moving from a naive, single-stage vector search to a multi-stage, hybrid retrieval system, we build RAG pipelines that are significantly more robust and production-ready. Combining dense (semantic) and sparse (lexical) search addresses the fundamental weaknesses of each approach, providing a system that understands both conceptual intent and literal keywords.

Using PostgreSQL with pgvector and its native Full-Text Search offers a powerful, operationally simple foundation for this advanced architecture. By layering techniques like Reciprocal Rank Fusion and optional cross-encoder re-ranking, you can build a retrieval system that delivers highly relevant context to your LLM, dramatically improving the quality and factual accuracy of your application's generated answers.

The Achilles' Heel of Pure Semantic Search in RAG

Architecting a Hybrid Search RAG Pipeline with PostgreSQL

Deep Dive: Data Ingestion and Indexing

Database Schema

Chunking and Vectorization Strategy

Advanced Indexing Strategy

Implementing the Hybrid Search Query Logic

The Fusion Problem: Incomparable Scores

Reciprocal Rank Fusion (RRF)

Code Implementation (Python)

Production Considerations and Edge Cases

Performance Tuning `pgvector`

Weighted Fusion vs. RRF

The Final Re-ranking Step: Cross-Encoders

Tying It All Together: The Final LLM Prompt

Conclusion

Found this article helpful?