Optimizing RAG: Advanced Chunking & Metadata Filtering with pg_vector

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production RAG Problem: Beyond Naive Implementations

For any senior engineer tasked with building a production-grade Retrieval-Augmented Generation (RAG) system, the initial excitement of a simple proof-of-concept quickly fades. The "naive RAG" pattern—splitting documents into fixed-size, overlapping chunks and performing a simple cosine similarity search—is a fantastic starting point for demos, but it crumbles under the weight of real-world complexity.

Production document corpora are not clean, uniform paragraphs. They are a chaotic mix of API documentation with nested code blocks, financial reports with intricate tables, and internal wikis with complex hierarchies. A naive chunking strategy will inevitably sever critical context: a function signature gets separated from its parameter descriptions, a table header is divorced from its data rows, or a concluding sentence loses its preceding argument.

The result is a system that provides lackluster, irrelevant, or factually incorrect responses. The core problem isn't the Language Model (LLM); it's the retrieval step. Garbage in, garbage out. If you retrieve poor context, you will generate a poor answer.

This article bypasses the basics and dives straight into the two pillars of building a robust, production-ready RAG pipeline:

  • Intelligent, Content-Aware Chunking: Moving from fixed-size splits to strategies that understand the semantic and structural boundaries of your documents.
  • Hybrid Search with Rich Metadata Filtering: Leveraging the full power of a relational database like PostgreSQL with the pg_vector extension to combine vector similarity with precise, structured metadata filtering for faster and more accurate retrieval.
  • Our scenario will be building a RAG system for a large corpus of internal technical documentation, a common and challenging use case where precision and context are paramount.


    Part 1: The Fallacy of Fixed-Size Chunking

    Let's start by demonstrating precisely why the most common approach fails. Consider this snippet of Markdown from an API document:

    markdown
    ### User Authentication Endpoint
    
    The `POST /v2/auth/token` endpoint is used to obtain an API access token.
    
    **Request Body:**
    

    {

    "grant_type": "client_credentials",

    "client_id": "your_client_id",

    "client_secret": "your_client_secret"

    }

    text
    **Permissions Required:**
    
    - `auth:token:create`
    
    This endpoint is rate-limited to 100 requests per minute per IP address. Unauthorized requests will receive a `401 Unauthorized` status code.

    Now, let's apply a naive fixed-size chunking strategy (e.g., 100 tokens with a 20-token overlap).

    Chunk 1:

    `### User Authentication Endpoint

    The POST /v2/auth/token endpoint is used to obtain an API access token.

    Request Body:

    json
    {
      "grant_type": "client_credentials",
      "client_id": "your_client_id",
      "client_secret": "your_client_secret"
    }

    `

    Chunk 2:

    `"client_id": "your_client_id",

    "client_secret": "your_client_secret"

    }

    text
    **Permissions Required:**
    
    - `auth:token:create`
    
    This endpoint is rate-limited to 100 requests per minute per IP address. Unauthorized`
    
    **Chunk 3:**
    `per minute per IP address. Unauthorized requests will receive a `401 Unauthorized` status code.`
    
    The problems are immediately obvious:
    
    *   **Broken Syntax:** Chunk 2 starts in the middle of a JSON object, creating a syntactically invalid and semantically meaningless piece of text.
    *   **Orphaned Context:** Chunk 3 discusses the `401` error code, but the critical context of which endpoint it applies to (`/v2/auth/token`) is lost, residing only in Chunk 1.
    
    A query like "What permissions are needed for the auth endpoint?" might correctly retrieve Chunk 2, but the LLM won't have the endpoint name to ground its answer. A query like "What is the rate limit for token generation?" might retrieve Chunk 2 and 3, but the crucial JSON body from Chunk 1 is missing. This is the semantic boundary problem, and solving it is non-negotiable for production quality.
    
    ## Part 2: Advanced Chunking Strategies & Implementation
    
    To overcome these limitations, we must adopt chunking strategies that respect the document's inherent structure.
    
    ### Strategy 1: Recursive Character Text Splitting
    
    This is a step up from fixed-size chunking and is a common default in libraries like LangChain. The algorithm attempts to split text based on a prioritized list of separators. It tries the first separator (e.g., `\n\n` for paragraphs), and if the resulting chunks are still too large, it recursively applies the next separator in the list (e.g., `\n`, then space) to those oversized chunks.
    
    **Python Implementation (Conceptual Logic):**
    

    def recursive_character_split(text: str, separators: list[str], chunk_size: int) -> list[str]:

    if not separators:

    # Base case: if no more separators, just split by size

    return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

    final_chunks = []

    current_separator = separators[0]

    remaining_separators = separators[1:]

    # Split by the highest priority separator

    splits = text.split(current_separator)

    buffer = ""

    for split in splits:

    if len(buffer) + len(split) + len(current_separator) > chunk_size:

    if buffer:

    # If the buffer itself is too large, it needs further splitting

    if len(buffer) > chunk_size:

    final_chunks.extend(recursive_character_split(buffer, remaining_separators, chunk_size))

    else:

    final_chunks.append(buffer)

    buffer = split

    else:

    if buffer:

    buffer += current_separator + split

    else:

    buffer = split

    if buffer:

    if len(buffer) > chunk_size:

    final_chunks.extend(recursive_character_split(buffer, remaining_separators, chunk_size))

    else:

    final_chunks.append(buffer)

    return final_chunks

    Usage

    markdown_text = "... your markdown document ..."

    separators = ["\n\n", "\n", " ", ""]

    chunks = recursive_character_split(markdown_text, separators, chunk_size=512)

    text
    *   **Pros:** Simple to implement, better than fixed-size splitting as it tends to respect paragraph boundaries.
    *   **Cons:** Still fundamentally naive about the *semantic* content. It doesn't understand that a code block or a table is a single, indivisible unit.
    
    ### Strategy 2: Semantic Chunking
    
    This approach moves beyond character-based heuristics and uses the embedding model itself to determine chunk boundaries. The core idea is that sentences forming a coherent topic will have similar embeddings. A sharp change in similarity between consecutive sentences likely indicates a topic shift—a perfect place to create a chunk boundary.
    
    **Algorithm:**
    
    1.  Split the document into individual sentences.
    2.  Generate an embedding for each sentence.
    3.  Calculate the cosine similarity between the embeddings of adjacent sentences (sentence `i` and `i+1`).
    4.  Identify points where the similarity drops significantly below a certain percentile threshold. These are the semantic break points.
    5.  Group sentences between these break points into chunks.
    
    **Python Implementation with NumPy:**
    

    import numpy as np

    from sklearn.metrics.pairwise import cosine_similarity

    from sentence_transformers import SentenceTransformer

    Assume you have a sentence tokenizer like NLTK

    from nltk.tokenize import sent_tokenize

    def semantic_chunker(text: str, model: SentenceTransformer, percentile_threshold: float = 95.0):

    sentences = sent_tokenize(text)

    if len(sentences) < 2:

    return [text]

    # Generate embeddings for all sentences

    embeddings = model.encode(sentences)

    # Calculate cosine similarity between adjacent sentences

    similarities = []

    for i in range(len(sentences) - 1):

    sim = cosine_similarity(embeddings[i].reshape(1, -1), embeddings[i+1].reshape(1, -1))[0, 0]

    similarities.append(sim)

    # Identify split points based on the percentile threshold

    threshold = np.percentile(similarities, percentile_threshold)

    split_indices = [i + 1 for i, sim in enumerate(similarities) if sim < threshold]

    chunks = []

    start_idx = 0

    for end_idx in split_indices:

    chunks.append(" ".join(sentences[start_idx:end_idx]))

    start_idx = end_idx

    # Add the final chunk

    chunks.append(" ".join(sentences[start_idx:]))

    return chunks

    Usage

    document_text = "... your long document with multiple topics ..."

    Use a model good for semantic similarity

    model = SentenceTransformer('all-MiniLM-L6-v2')

    chunks = semantic_chunker(document_text, model, percentile_threshold=90)

    Now embed these semantically coherent chunks for storage

    text
    *   **Pros:** Creates highly coherent chunks, excellent for prose and documents where topics flow logically.
    *   **Cons:** Computationally expensive at ingestion time due to embedding every sentence. May not perform as well on structured data like code or tables where sentence structure is less relevant.
    
    ### Strategy 3: Hierarchical & Agentic Chunking
    
    For the most complex documents, we can use an LLM as part of the ingestion pipeline itself—an "agentic" approach. This involves creating a hierarchical representation of the document.
    
    **The "Parent Document Retriever" Pattern:**
    
    1.  **Initial Split:** Break the document into large, structurally-sound "parent" chunks (e.g., split a Markdown file by `##` H2 headers).
    2.  **Summarization:** For each parent chunk, use an LLM to generate a concise summary.
    3.  **Sub-Chunking:** Further split the parent chunks into smaller, more granular pieces suitable for retrieval.
    4.  **Storage:** Store both the small chunks and the summaries. When performing a search, you retrieve the most relevant small chunks but then provide the LLM with the *entire parent chunk* they belong to, giving it the full context.
    
    **Conceptual Logic:**
    

    This is conceptual and requires an LLM client (e.g., OpenAI)

    def agentic_hierarchical_chunking(document_text: str):

    parent_chunks = document_text.split('\n## ')

    processed_data = []

    for parent_chunk in parent_chunks:

    # Use LLM to generate a summary

    summary_prompt = f"Summarize the following text in one sentence: {parent_chunk[:1000]}"

    summary = call_llm(summary_prompt)

    # Use another method for sub-chunking

    sub_chunks = recursive_character_split(parent_chunk, ["\n\n", "\n"], chunk_size=256)

    parent_id = generate_uuid()

    processed_data.append({

    "id": parent_id,

    "type": "summary",

    "content": summary,

    "original_content_ref": parent_id

    })

    processed_data.append({

    "id": parent_id,

    "type": "parent_chunk",

    "content": parent_chunk

    })

    for sub_chunk in sub_chunks:

    processed_data.append({

    "id": generate_uuid(),

    "type": "sub_chunk",

    "content": sub_chunk,

    "parent_id": parent_id

    })

    return processed_data

    text
    This is the most powerful but also the most complex and expensive method. It creates a rich, multi-layered representation of your data that can dramatically improve retrieval for questions that require broad context.
    
    ---
    
    ## Part 3: A Production-Grade PostgreSQL Schema with `pg_vector`
    
    Your database schema is as important as your chunking strategy. A flat table is insufficient. We need to store our chunks, their embeddings, and the rich metadata that will power our hybrid search.
    
    Here is a robust schema designed for our technical documentation use case:
    

    -- Ensure the pg_vector extension is enabled

    CREATE EXTENSION IF NOT EXISTS vector;

    -- Table to store source documents and their high-level metadata

    CREATE TABLE documents (

    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    source_url TEXT NOT NULL UNIQUE, -- e.g., Confluence URL, Git repo path

    document_type VARCHAR(50) NOT NULL, -- 'api_docs', 'tutorial', 'architecture_decision_record'

    last_modified TIMESTAMPTZ DEFAULT NOW(),

    metadata JSONB -- High-level metadata: author, team, version tags

    );

    -- Table to store the individual chunks and their embeddings

    CREATE TABLE chunks (

    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,

    chunk_content TEXT NOT NULL,

    -- Assuming OpenAI Ada-002 embeddings (1536 dimensions)

    -- Choose the dimension based on your embedding model

    embedding VECTOR(1536) NOT NULL,

    -- Fine-grained metadata specific to this chunk

    chunk_metadata JSONB, -- e.g., {'section_title': 'Authentication', 'start_char': 512, 'chunk_level': 2}

    created_at TIMESTAMPTZ DEFAULT NOW()

    );

    -- Create an index on the document_id for efficient joins

    CREATE INDEX idx_chunks_document_id ON chunks(document_id);

    -- Create a GIN index on the metadata fields for fast JSONB lookups

    CREATE INDEX idx_documents_metadata ON documents USING GIN (metadata);

    CREATE INDEX idx_chunks_chunk_metadata ON chunks USING GIN (chunk_metadata);

    text
    **Schema Rationale:**
    
    *   **`documents` Table:** Acts as the canonical source of truth. Storing high-level metadata here (like `version`) prevents data duplication and allows for efficient filtering across all chunks of a document.
    *   **`chunks` Table:** This is our workhorse. It stores the text, the vector, and a crucial `chunk_metadata` JSONB field.
    *   **`document_id` Foreign Key:** This relational link is the superpower of using Postgres. It allows us to join chunk-level vector search results with document-level metadata filters.
    *   **`JSONB` for Metadata:** Provides the flexibility to store arbitrary structured data. We can filter on keys within this JSON, such as `version`, `section_title`, or `tags`.
    *   **GIN Indexes:** These are critical for making queries on `JSONB` fields performant. Without them, filtering on metadata would require a slow sequential scan.
    
    ---
    
    ## Part 4: Implementing High-Performance Hybrid Search
    
    Now we combine our schema and chunking strategies. Pure vector search is not enough. Consider the query: *"What changed in the v2.5 API regarding billing?"*
    
    A pure vector search might pull semantically similar content from v1.0 or v3.0 of the API docs, leading to incorrect answers. We must filter *before* we search.
    
    ### Pre-filtering vs. Post-filtering
    
    *   **Post-filtering (Bad):** Fetch the top `k` (e.g., 100) nearest neighbors based on vector similarity *first*, then apply the `WHERE` clause to filter that small result set. This is inefficient and often inaccurate, as the *truly* relevant (but slightly less similar) chunks might not even be in the initial top `k` results.
    *   **Pre-filtering (Good):** Apply the metadata `WHERE` clause *first* to narrow down the search space, and then perform the vector search only on the filtered subset. This is far more efficient and accurate.
    
    With `pg_vector` and our schema, we can write a single, powerful SQL query that performs pre-filtering.
    
    ### The Hybrid Search SQL Query
    
    Let's assume we have a placeholder function `get_embedding_vector(:query_text)` that returns the vector for our query.
    

    -- Our user's query

    -- :query_text = 'What changed in the v2.5 API regarding billing?'

    -- :version_filter = '2.5%'

    -- :doc_type_filter = 'api_docs'

    -- :limit = 10

    SELECT

    c.chunk_content,

    c.chunk_metadata,

    d.source_url,

    -- The <=> operator calculates cosine distance (0=identical, 2=opposite)

    c.embedding <=> get_embedding_vector(:query_text) AS distance

    FROM

    chunks c

    JOIN

    documents d ON c.document_id = d.id

    WHERE

    -- Metadata filtering happens here, BEFORE the vector search

    d.document_type = :doc_type_filter AND

    (d.metadata->>'version')::text LIKE :version_filter

    ORDER BY

    distance

    LIMIT

    :limit;

    text
    **Dissecting the Query:**
    
    1.  **`JOIN documents d ON c.document_id = d.id`:** We link chunks to their parent documents.
    2.  **`WHERE d.document_type = ... AND (d.metadata->>'version')::text LIKE ...`:** This is the pre-filtering magic. The database query planner will use the B-tree and GIN indexes on the `documents` table to rapidly identify only the documents that match our criteria (`api_docs` of version `2.5.x`).
    3.  **`ORDER BY c.embedding <=> ...`:** The vector similarity search is performed *only on the chunks belonging to the filtered documents*. The `<=>` operator is the cosine distance operator from `pg_vector`.
    4.  **`LIMIT 10`:** We retrieve the top 10 most relevant chunks from the filtered set.
    
    This single query is vastly superior to a multi-step process. It's atomic, efficient, and leverages decades of query optimization from PostgreSQL.
    
    ---
    
    ## Part 5: Production Performance, Indexing, and Edge Cases
    
    Writing the query is one thing; making it run in milliseconds on millions of vectors is another. This is where indexing becomes non-negotiable.
    
    ### Critical `pg_vector` Indexing
    
    `pg_vector` supports Approximate Nearest Neighbor (ANN) search through indexes, which trades perfect accuracy for massive speed improvements. The two main types are IVFFlat and HNSW.
    
    *   **HNSW (Hierarchical Navigable Small World):** Generally the recommended choice. It offers excellent query performance with high recall and is robust to data distribution. It has two key parameters:
        *   `m`: The maximum number of connections per layer (default 16).
        *   `ef_construction`: The size of the dynamic candidate list during index build (default 64).
    
    **Creating the HNSW Index:**
    

    -- This can take a long time on a large table!

    CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

    text
    **Tuning HNSW:**
    
    *   Higher `m` and `ef_construction` values lead to a more accurate (higher recall) index that takes longer to build and uses more memory.
    *   For production, start with the defaults and benchmark. A good starting point for tuning might be `m = 24` and `ef_construction = 128`.
    *   You can also tune search-time accuracy vs. speed with the `hnsw.ef_search` parameter: `SET hnsw.ef_search = 100;` before running your query.
    
    ### Analyzing the Query Plan with `EXPLAIN ANALYZE`
    
    Never guess about performance. Use `EXPLAIN ANALYZE` to see exactly how PostgreSQL is executing your query.
    
    **Without a vector index:**
    

    -> Limit (cost=... rows=10 width=...)

    -> Sort (cost=... rows=... width=...)

    Sort Key: ((embedding <=> ...))

    -> Nested Loop

    -> Index Scan using ... on documents

    -> Seq Scan on chunks

    text
    Notice the `Seq Scan on chunks`. The database is brute-force calculating the distance for every single chunk that passes the metadata filter. This will be incredibly slow.
    
    **With an HNSW vector index:**
    

    -> Limit (cost=... rows=10 width=...)

    -> Index Scan using on chunks (cost=... rows=...)

    Order By: (embedding <=> ...)

    -> Nested Loop

    -> Index Scan using ... on documents

    text
    Now, you see an `Index Scan` using your HNSW index. This is an ANN search, which will be orders of magnitude faster.
    
    ### Edge Case: The "Lost in the Middle" Problem
    
    ANN search is great for finding the top `k` results, but what if a key piece of information is buried in the middle of a long document? It might be semantically relevant but not score high enough on cosine similarity to make it into your top 10. The solution is **re-ranking**.
    
    1.  **Retrieve Broadly:** Fetch a larger number of candidates from your database, e.g., `k=50`.
    2.  **Re-rank with a Cross-Encoder:** Use a more powerful, computationally expensive cross-encoder model (like a `Cohere Rerank` or a custom BERT model). Unlike a bi-encoder (which creates embeddings separately), a cross-encoder takes the query and a document chunk *together* as input and outputs a single relevance score. This allows it to capture much finer-grained nuances.
    
    **Python Re-ranking Logic:**
    

    1. Fetch top 50 candidates from Postgres

    initial_candidates = execute_hybrid_search_query(query, limit=50)

    2. Use a cross-encoder to re-rank

    from sentence_transformers import CrossEncoder

    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    pairs = [[query, candidate['chunk_content']] for candidate in initial_candidates]

    scores = cross_encoder.predict(pairs)

    Combine candidates with new scores and sort

    for i, candidate in enumerate(initial_candidates):

    candidate['rerank_score'] = scores[i]

    reranked_results = sorted(initial_candidates, key=lambda x: x['rerank_score'], reverse=True)

    3. Use the top 5 from the re-ranked list as your final context

    final_context = reranked_results[:5]

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles