Optimizing RAG: Advanced Chunking & Metadata Filtering with pg_vector

October 13, 2025

19 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production RAG Problem: Beyond Naive Implementations

For any senior engineer tasked with building a production-grade Retrieval-Augmented Generation (RAG) system, the initial excitement of a simple proof-of-concept quickly fades. The "naive RAG" pattern—splitting documents into fixed-size, overlapping chunks and performing a simple cosine similarity search—is a fantastic starting point for demos, but it crumbles under the weight of real-world complexity.

Production document corpora are not clean, uniform paragraphs. They are a chaotic mix of API documentation with nested code blocks, financial reports with intricate tables, and internal wikis with complex hierarchies. A naive chunking strategy will inevitably sever critical context: a function signature gets separated from its parameter descriptions, a table header is divorced from its data rows, or a concluding sentence loses its preceding argument.

The result is a system that provides lackluster, irrelevant, or factually incorrect responses. The core problem isn't the Language Model (LLM); it's the retrieval step. Garbage in, garbage out. If you retrieve poor context, you will generate a poor answer.

This article bypasses the basics and dives straight into the two pillars of building a robust, production-ready RAG pipeline:

Intelligent, Content-Aware Chunking: Moving from fixed-size splits to strategies that understand the semantic and structural boundaries of your documents.

Hybrid Search with Rich Metadata Filtering: Leveraging the full power of a relational database like PostgreSQL with the pg_vector extension to combine vector similarity with precise, structured metadata filtering for faster and more accurate retrieval.

Our scenario will be building a RAG system for a large corpus of internal technical documentation, a common and challenging use case where precision and context are paramount.

Part 1: The Fallacy of Fixed-Size Chunking

Let's start by demonstrating precisely why the most common approach fails. Consider this snippet of Markdown from an API document:

markdown

### User Authentication Endpoint

The `POST /v2/auth/token` endpoint is used to obtain an API access token.

**Request Body:**

{

"grant_type": "client_credentials",

"client_id": "your_client_id",

"client_secret": "your_client_secret"

}

text

**Permissions Required:**

- `auth:token:create`

This endpoint is rate-limited to 100 requests per minute per IP address. Unauthorized requests will receive a `401 Unauthorized` status code.

Now, let's apply a naive fixed-size chunking strategy (e.g., 100 tokens with a 20-token overlap).

Chunk 1:

`### User Authentication Endpoint

The POST /v2/auth/token endpoint is used to obtain an API access token.

Request Body:

json

{
  "grant_type": "client_credentials",
  "client_id": "your_client_id",
  "client_secret": "your_client_secret"
}

Chunk 2:

`"client_id": "your_client_id",

"client_secret": "your_client_secret"

}

text

**Permissions Required:**

- `auth:token:create`

This endpoint is rate-limited to 100 requests per minute per IP address. Unauthorized`

**Chunk 3:**
`per minute per IP address. Unauthorized requests will receive a `401 Unauthorized` status code.`

The problems are immediately obvious:

*   **Broken Syntax:** Chunk 2 starts in the middle of a JSON object, creating a syntactically invalid and semantically meaningless piece of text.
*   **Orphaned Context:** Chunk 3 discusses the `401` error code, but the critical context of which endpoint it applies to (`/v2/auth/token`) is lost, residing only in Chunk 1.

A query like "What permissions are needed for the auth endpoint?" might correctly retrieve Chunk 2, but the LLM won't have the endpoint name to ground its answer. A query like "What is the rate limit for token generation?" might retrieve Chunk 2 and 3, but the crucial JSON body from Chunk 1 is missing. This is the semantic boundary problem, and solving it is non-negotiable for production quality.

## Part 2: Advanced Chunking Strategies & Implementation

To overcome these limitations, we must adopt chunking strategies that respect the document's inherent structure.

### Strategy 1: Recursive Character Text Splitting

This is a step up from fixed-size chunking and is a common default in libraries like LangChain. The algorithm attempts to split text based on a prioritized list of separators. It tries the first separator (e.g., `\n\n` for paragraphs), and if the resulting chunks are still too large, it recursively applies the next separator in the list (e.g., `\n`, then space) to those oversized chunks.

**Python Implementation (Conceptual Logic):**

def recursive_character_split(text: str, separators: list[str], chunk_size: int) -> list[str]:

if not separators:

# Base case: if no more separators, just split by size

return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

final_chunks = []

current_separator = separators[0]

remaining_separators = separators[1:]

# Split by the highest priority separator

splits = text.split(current_separator)

buffer = ""

for split in splits:

if len(buffer) + len(split) + len(current_separator) > chunk_size:

if buffer:

# If the buffer itself is too large, it needs further splitting

if len(buffer) > chunk_size:

final_chunks.extend(recursive_character_split(buffer, remaining_separators, chunk_size))

else:

final_chunks.append(buffer)

buffer = split

else:

if buffer:

buffer += current_separator + split

else:

buffer = split

if buffer:

if len(buffer) > chunk_size:

final_chunks.extend(recursive_character_split(buffer, remaining_separators, chunk_size))

else:

final_chunks.append(buffer)

return final_chunks

Usage

markdown_text = "... your markdown document ..."

separators = ["\n\n", "\n", " ", ""]

chunks = recursive_character_split(markdown_text, separators, chunk_size=512)

text

*   **Pros:** Simple to implement, better than fixed-size splitting as it tends to respect paragraph boundaries.
*   **Cons:** Still fundamentally naive about the *semantic* content. It doesn't understand that a code block or a table is a single, indivisible unit.

### Strategy 2: Semantic Chunking

This approach moves beyond character-based heuristics and uses the embedding model itself to determine chunk boundaries. The core idea is that sentences forming a coherent topic will have similar embeddings. A sharp change in similarity between consecutive sentences likely indicates a topic shift—a perfect place to create a chunk boundary.

**Algorithm:**

1.  Split the document into individual sentences.
2.  Generate an embedding for each sentence.
3.  Calculate the cosine similarity between the embeddings of adjacent sentences (sentence `i` and `i+1`).
4.  Identify points where the similarity drops significantly below a certain percentile threshold. These are the semantic break points.
5.  Group sentences between these break points into chunks.

**Python Implementation with NumPy:**

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer

Assume you have a sentence tokenizer like NLTK

from nltk.tokenize import sent_tokenize

def semantic_chunker(text: str, model: SentenceTransformer, percentile_threshold: float = 95.0):

sentences = sent_tokenize(text)

if len(sentences) < 2:

return [text]

# Generate embeddings for all sentences

embeddings = model.encode(sentences)

# Calculate cosine similarity between adjacent sentences

similarities = []

for i in range(len(sentences) - 1):

sim = cosine_similarity(embeddings[i].reshape(1, -1), embeddings[i+1].reshape(1, -1))[0, 0]

similarities.append(sim)

# Identify split points based on the percentile threshold

threshold = np.percentile(similarities, percentile_threshold)

split_indices = [i + 1 for i, sim in enumerate(similarities) if sim < threshold]

chunks = []

start_idx = 0

for end_idx in split_indices:

chunks.append(" ".join(sentences[start_idx:end_idx]))

start_idx = end_idx

# Add the final chunk

chunks.append(" ".join(sentences[start_idx:]))

return chunks

Usage

document_text = "... your long document with multiple topics ..."

Use a model good for semantic similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

chunks = semantic_chunker(document_text, model, percentile_threshold=90)

Now embed these semantically coherent chunks for storage

text

*   **Pros:** Creates highly coherent chunks, excellent for prose and documents where topics flow logically.
*   **Cons:** Computationally expensive at ingestion time due to embedding every sentence. May not perform as well on structured data like code or tables where sentence structure is less relevant.

### Strategy 3: Hierarchical & Agentic Chunking

For the most complex documents, we can use an LLM as part of the ingestion pipeline itself—an "agentic" approach. This involves creating a hierarchical representation of the document.

**The "Parent Document Retriever" Pattern:**

1.  **Initial Split:** Break the document into large, structurally-sound "parent" chunks (e.g., split a Markdown file by `##` H2 headers).
2.  **Summarization:** For each parent chunk, use an LLM to generate a concise summary.
3.  **Sub-Chunking:** Further split the parent chunks into smaller, more granular pieces suitable for retrieval.
4.  **Storage:** Store both the small chunks and the summaries. When performing a search, you retrieve the most relevant small chunks but then provide the LLM with the *entire parent chunk* they belong to, giving it the full context.

**Conceptual Logic:**

This is conceptual and requires an LLM client (e.g., OpenAI)

def agentic_hierarchical_chunking(document_text: str):

parent_chunks = document_text.split('\n## ')

processed_data = []

for parent_chunk in parent_chunks:

# Use LLM to generate a summary

summary_prompt = f"Summarize the following text in one sentence: {parent_chunk[:1000]}"

summary = call_llm(summary_prompt)

# Use another method for sub-chunking

sub_chunks = recursive_character_split(parent_chunk, ["\n\n", "\n"], chunk_size=256)

parent_id = generate_uuid()

processed_data.append({

"id": parent_id,

"type": "summary",

"content": summary,

"original_content_ref": parent_id

})

processed_data.append({

"id": parent_id,

"type": "parent_chunk",

"content": parent_chunk

})

for sub_chunk in sub_chunks:

processed_data.append({

"id": generate_uuid(),

"type": "sub_chunk",

"content": sub_chunk,

"parent_id": parent_id

})

return processed_data

text

This is the most powerful but also the most complex and expensive method. It creates a rich, multi-layered representation of your data that can dramatically improve retrieval for questions that require broad context.

---

## Part 3: A Production-Grade PostgreSQL Schema with `pg_vector`

Your database schema is as important as your chunking strategy. A flat table is insufficient. We need to store our chunks, their embeddings, and the rich metadata that will power our hybrid search.

Here is a robust schema designed for our technical documentation use case:

-- Ensure the pg_vector extension is enabled

CREATE EXTENSION IF NOT EXISTS vector;

-- Table to store source documents and their high-level metadata

CREATE TABLE documents (

id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

source_url TEXT NOT NULL UNIQUE, -- e.g., Confluence URL, Git repo path

document_type VARCHAR(50) NOT NULL, -- 'api_docs', 'tutorial', 'architecture_decision_record'

last_modified TIMESTAMPTZ DEFAULT NOW(),

metadata JSONB -- High-level metadata: author, team, version tags

);

-- Table to store the individual chunks and their embeddings

CREATE TABLE chunks (

id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,

chunk_content TEXT NOT NULL,

-- Assuming OpenAI Ada-002 embeddings (1536 dimensions)

-- Choose the dimension based on your embedding model

embedding VECTOR(1536) NOT NULL,

-- Fine-grained metadata specific to this chunk

chunk_metadata JSONB, -- e.g., {'section_title': 'Authentication', 'start_char': 512, 'chunk_level': 2}

created_at TIMESTAMPTZ DEFAULT NOW()

);

-- Create an index on the document_id for efficient joins

CREATE INDEX idx_chunks_document_id ON chunks(document_id);

-- Create a GIN index on the metadata fields for fast JSONB lookups

CREATE INDEX idx_documents_metadata ON documents USING GIN (metadata);

CREATE INDEX idx_chunks_chunk_metadata ON chunks USING GIN (chunk_metadata);

text

**Schema Rationale:**

*   **`documents` Table:** Acts as the canonical source of truth. Storing high-level metadata here (like `version`) prevents data duplication and allows for efficient filtering across all chunks of a document.
*   **`chunks` Table:** This is our workhorse. It stores the text, the vector, and a crucial `chunk_metadata` JSONB field.
*   **`document_id` Foreign Key:** This relational link is the superpower of using Postgres. It allows us to join chunk-level vector search results with document-level metadata filters.
*   **`JSONB` for Metadata:** Provides the flexibility to store arbitrary structured data. We can filter on keys within this JSON, such as `version`, `section_title`, or `tags`.
*   **GIN Indexes:** These are critical for making queries on `JSONB` fields performant. Without them, filtering on metadata would require a slow sequential scan.

---

## Part 4: Implementing High-Performance Hybrid Search

Now we combine our schema and chunking strategies. Pure vector search is not enough. Consider the query: *"What changed in the v2.5 API regarding billing?"*

A pure vector search might pull semantically similar content from v1.0 or v3.0 of the API docs, leading to incorrect answers. We must filter *before* we search.

### Pre-filtering vs. Post-filtering

*   **Post-filtering (Bad):** Fetch the top `k` (e.g., 100) nearest neighbors based on vector similarity *first*, then apply the `WHERE` clause to filter that small result set. This is inefficient and often inaccurate, as the *truly* relevant (but slightly less similar) chunks might not even be in the initial top `k` results.
*   **Pre-filtering (Good):** Apply the metadata `WHERE` clause *first* to narrow down the search space, and then perform the vector search only on the filtered subset. This is far more efficient and accurate.

With `pg_vector` and our schema, we can write a single, powerful SQL query that performs pre-filtering.

### The Hybrid Search SQL Query

Let's assume we have a placeholder function `get_embedding_vector(:query_text)` that returns the vector for our query.

-- Our user's query

-- :query_text = 'What changed in the v2.5 API regarding billing?'

-- :version_filter = '2.5%'

-- :doc_type_filter = 'api_docs'

-- :limit = 10

SELECT

c.chunk_content,

c.chunk_metadata,

d.source_url,

-- The <=> operator calculates cosine distance (0=identical, 2=opposite)

c.embedding <=> get_embedding_vector(:query_text) AS distance

FROM

chunks c

JOIN

documents d ON c.document_id = d.id

WHERE

-- Metadata filtering happens here, BEFORE the vector search

d.document_type = :doc_type_filter AND

(d.metadata->>'version')::text LIKE :version_filter

ORDER BY

distance

LIMIT

:limit;

text

**Dissecting the Query:**

1.  **`JOIN documents d ON c.document_id = d.id`:** We link chunks to their parent documents.
2.  **`WHERE d.document_type = ... AND (d.metadata->>'version')::text LIKE ...`:** This is the pre-filtering magic. The database query planner will use the B-tree and GIN indexes on the `documents` table to rapidly identify only the documents that match our criteria (`api_docs` of version `2.5.x`).
3.  **`ORDER BY c.embedding <=> ...`:** The vector similarity search is performed *only on the chunks belonging to the filtered documents*. The `<=>` operator is the cosine distance operator from `pg_vector`.
4.  **`LIMIT 10`:** We retrieve the top 10 most relevant chunks from the filtered set.

This single query is vastly superior to a multi-step process. It's atomic, efficient, and leverages decades of query optimization from PostgreSQL.

---

## Part 5: Production Performance, Indexing, and Edge Cases

Writing the query is one thing; making it run in milliseconds on millions of vectors is another. This is where indexing becomes non-negotiable.

### Critical `pg_vector` Indexing

`pg_vector` supports Approximate Nearest Neighbor (ANN) search through indexes, which trades perfect accuracy for massive speed improvements. The two main types are IVFFlat and HNSW.

*   **HNSW (Hierarchical Navigable Small World):** Generally the recommended choice. It offers excellent query performance with high recall and is robust to data distribution. It has two key parameters:
    *   `m`: The maximum number of connections per layer (default 16).
    *   `ef_construction`: The size of the dynamic candidate list during index build (default 64).

**Creating the HNSW Index:**

-- This can take a long time on a large table!

CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

text

**Tuning HNSW:**

*   Higher `m` and `ef_construction` values lead to a more accurate (higher recall) index that takes longer to build and uses more memory.
*   For production, start with the defaults and benchmark. A good starting point for tuning might be `m = 24` and `ef_construction = 128`.
*   You can also tune search-time accuracy vs. speed with the `hnsw.ef_search` parameter: `SET hnsw.ef_search = 100;` before running your query.

### Analyzing the Query Plan with `EXPLAIN ANALYZE`

Never guess about performance. Use `EXPLAIN ANALYZE` to see exactly how PostgreSQL is executing your query.

**Without a vector index:**

-> Limit (cost=... rows=10 width=...)

-> Sort (cost=... rows=... width=...)

Sort Key: ((embedding <=> ...))

-> Nested Loop

-> Index Scan using ... on documents

-> Seq Scan on chunks

text

Notice the `Seq Scan on chunks`. The database is brute-force calculating the distance for every single chunk that passes the metadata filter. This will be incredibly slow.

**With an HNSW vector index:**

-> Limit (cost=... rows=10 width=...)

-> Index Scan using on chunks (cost=... rows=...)

Order By: (embedding <=> ...)

-> Nested Loop

-> Index Scan using ... on documents

text

Now, you see an `Index Scan` using your HNSW index. This is an ANN search, which will be orders of magnitude faster.

### Edge Case: The "Lost in the Middle" Problem

ANN search is great for finding the top `k` results, but what if a key piece of information is buried in the middle of a long document? It might be semantically relevant but not score high enough on cosine similarity to make it into your top 10. The solution is **re-ranking**.

1.  **Retrieve Broadly:** Fetch a larger number of candidates from your database, e.g., `k=50`.
2.  **Re-rank with a Cross-Encoder:** Use a more powerful, computationally expensive cross-encoder model (like a `Cohere Rerank` or a custom BERT model). Unlike a bi-encoder (which creates embeddings separately), a cross-encoder takes the query and a document chunk *together* as input and outputs a single relevance score. This allows it to capture much finer-grained nuances.

**Python Re-ranking Logic:**

1. Fetch top 50 candidates from Postgres

initial_candidates = execute_hybrid_search_query(query, limit=50)

2. Use a cross-encoder to re-rank

from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

pairs = [[query, candidate['chunk_content']] for candidate in initial_candidates]

scores = cross_encoder.predict(pairs)

Combine candidates with new scores and sort

for i, candidate in enumerate(initial_candidates):

candidate['rerank_score'] = scores[i]

reranked_results = sorted(initial_candidates, key=lambda x: x['rerank_score'], reverse=True)

3. Use the top 5 from the re-ranked list as your final context

final_context = reranked_results[:5]

The Production RAG Problem: Beyond Naive Implementations

Part 1: The Fallacy of Fixed-Size Chunking

Usage

Assume you have a sentence tokenizer like NLTK

Usage

Use a model good for semantic similarity

Now embed these semantically coherent chunks for storage

This is conceptual and requires an LLM client (e.g., OpenAI)

1. Fetch top 50 candidates from Postgres

2. Use a cross-encoder to re-rank

Combine candidates with new scores and sort

3. Use the top 5 from the re-ranked list as your final context

Found this article helpful?