Optimizing RAG: Advanced Chunking & Metadata Filtering with pg_vector
The Production RAG Problem: Beyond Naive Implementations
For any senior engineer tasked with building a production-grade Retrieval-Augmented Generation (RAG) system, the initial excitement of a simple proof-of-concept quickly fades. The "naive RAG" pattern—splitting documents into fixed-size, overlapping chunks and performing a simple cosine similarity search—is a fantastic starting point for demos, but it crumbles under the weight of real-world complexity.
Production document corpora are not clean, uniform paragraphs. They are a chaotic mix of API documentation with nested code blocks, financial reports with intricate tables, and internal wikis with complex hierarchies. A naive chunking strategy will inevitably sever critical context: a function signature gets separated from its parameter descriptions, a table header is divorced from its data rows, or a concluding sentence loses its preceding argument.
The result is a system that provides lackluster, irrelevant, or factually incorrect responses. The core problem isn't the Language Model (LLM); it's the retrieval step. Garbage in, garbage out. If you retrieve poor context, you will generate a poor answer.
This article bypasses the basics and dives straight into the two pillars of building a robust, production-ready RAG pipeline:
pg_vector extension to combine vector similarity with precise, structured metadata filtering for faster and more accurate retrieval.Our scenario will be building a RAG system for a large corpus of internal technical documentation, a common and challenging use case where precision and context are paramount.
Part 1: The Fallacy of Fixed-Size Chunking
Let's start by demonstrating precisely why the most common approach fails. Consider this snippet of Markdown from an API document:
### User Authentication Endpoint
The `POST /v2/auth/token` endpoint is used to obtain an API access token.
**Request Body:**
{
"grant_type": "client_credentials",
"client_id": "your_client_id",
"client_secret": "your_client_secret"
}
**Permissions Required:**
- `auth:token:create`
This endpoint is rate-limited to 100 requests per minute per IP address. Unauthorized requests will receive a `401 Unauthorized` status code.
Now, let's apply a naive fixed-size chunking strategy (e.g., 100 tokens with a 20-token overlap).
Chunk 1:
`### User Authentication Endpoint
The POST /v2/auth/token endpoint is used to obtain an API access token.
Request Body:
{
"grant_type": "client_credentials",
"client_id": "your_client_id",
"client_secret": "your_client_secret"
}
`
Chunk 2:
`"client_id": "your_client_id",
"client_secret": "your_client_secret"
}
**Permissions Required:**
- `auth:token:create`
This endpoint is rate-limited to 100 requests per minute per IP address. Unauthorized`
**Chunk 3:**
`per minute per IP address. Unauthorized requests will receive a `401 Unauthorized` status code.`
The problems are immediately obvious:
* **Broken Syntax:** Chunk 2 starts in the middle of a JSON object, creating a syntactically invalid and semantically meaningless piece of text.
* **Orphaned Context:** Chunk 3 discusses the `401` error code, but the critical context of which endpoint it applies to (`/v2/auth/token`) is lost, residing only in Chunk 1.
A query like "What permissions are needed for the auth endpoint?" might correctly retrieve Chunk 2, but the LLM won't have the endpoint name to ground its answer. A query like "What is the rate limit for token generation?" might retrieve Chunk 2 and 3, but the crucial JSON body from Chunk 1 is missing. This is the semantic boundary problem, and solving it is non-negotiable for production quality.
## Part 2: Advanced Chunking Strategies & Implementation
To overcome these limitations, we must adopt chunking strategies that respect the document's inherent structure.
### Strategy 1: Recursive Character Text Splitting
This is a step up from fixed-size chunking and is a common default in libraries like LangChain. The algorithm attempts to split text based on a prioritized list of separators. It tries the first separator (e.g., `\n\n` for paragraphs), and if the resulting chunks are still too large, it recursively applies the next separator in the list (e.g., `\n`, then space) to those oversized chunks.
**Python Implementation (Conceptual Logic):**
def recursive_character_split(text: str, separators: list[str], chunk_size: int) -> list[str]:
if not separators:
# Base case: if no more separators, just split by size
return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
final_chunks = []
current_separator = separators[0]
remaining_separators = separators[1:]
# Split by the highest priority separator
splits = text.split(current_separator)
buffer = ""
for split in splits:
if len(buffer) + len(split) + len(current_separator) > chunk_size:
if buffer:
# If the buffer itself is too large, it needs further splitting
if len(buffer) > chunk_size:
final_chunks.extend(recursive_character_split(buffer, remaining_separators, chunk_size))
else:
final_chunks.append(buffer)
buffer = split
else:
if buffer:
buffer += current_separator + split
else:
buffer = split
if buffer:
if len(buffer) > chunk_size:
final_chunks.extend(recursive_character_split(buffer, remaining_separators, chunk_size))
else:
final_chunks.append(buffer)
return final_chunks
Usage
markdown_text = "... your markdown document ..."
separators = ["\n\n", "\n", " ", ""]
chunks = recursive_character_split(markdown_text, separators, chunk_size=512)
* **Pros:** Simple to implement, better than fixed-size splitting as it tends to respect paragraph boundaries.
* **Cons:** Still fundamentally naive about the *semantic* content. It doesn't understand that a code block or a table is a single, indivisible unit.
### Strategy 2: Semantic Chunking
This approach moves beyond character-based heuristics and uses the embedding model itself to determine chunk boundaries. The core idea is that sentences forming a coherent topic will have similar embeddings. A sharp change in similarity between consecutive sentences likely indicates a topic shift—a perfect place to create a chunk boundary.
**Algorithm:**
1. Split the document into individual sentences.
2. Generate an embedding for each sentence.
3. Calculate the cosine similarity between the embeddings of adjacent sentences (sentence `i` and `i+1`).
4. Identify points where the similarity drops significantly below a certain percentile threshold. These are the semantic break points.
5. Group sentences between these break points into chunks.
**Python Implementation with NumPy:**
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
Assume you have a sentence tokenizer like NLTK
from nltk.tokenize import sent_tokenize
def semantic_chunker(text: str, model: SentenceTransformer, percentile_threshold: float = 95.0):
sentences = sent_tokenize(text)
if len(sentences) < 2:
return [text]
# Generate embeddings for all sentences
embeddings = model.encode(sentences)
# Calculate cosine similarity between adjacent sentences
similarities = []
for i in range(len(sentences) - 1):
sim = cosine_similarity(embeddings[i].reshape(1, -1), embeddings[i+1].reshape(1, -1))[0, 0]
similarities.append(sim)
# Identify split points based on the percentile threshold
threshold = np.percentile(similarities, percentile_threshold)
split_indices = [i + 1 for i, sim in enumerate(similarities) if sim < threshold]
chunks = []
start_idx = 0
for end_idx in split_indices:
chunks.append(" ".join(sentences[start_idx:end_idx]))
start_idx = end_idx
# Add the final chunk
chunks.append(" ".join(sentences[start_idx:]))
return chunks
Usage
document_text = "... your long document with multiple topics ..."
Use a model good for semantic similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
chunks = semantic_chunker(document_text, model, percentile_threshold=90)
Now embed these semantically coherent chunks for storage
* **Pros:** Creates highly coherent chunks, excellent for prose and documents where topics flow logically.
* **Cons:** Computationally expensive at ingestion time due to embedding every sentence. May not perform as well on structured data like code or tables where sentence structure is less relevant.
### Strategy 3: Hierarchical & Agentic Chunking
For the most complex documents, we can use an LLM as part of the ingestion pipeline itself—an "agentic" approach. This involves creating a hierarchical representation of the document.
**The "Parent Document Retriever" Pattern:**
1. **Initial Split:** Break the document into large, structurally-sound "parent" chunks (e.g., split a Markdown file by `##` H2 headers).
2. **Summarization:** For each parent chunk, use an LLM to generate a concise summary.
3. **Sub-Chunking:** Further split the parent chunks into smaller, more granular pieces suitable for retrieval.
4. **Storage:** Store both the small chunks and the summaries. When performing a search, you retrieve the most relevant small chunks but then provide the LLM with the *entire parent chunk* they belong to, giving it the full context.
**Conceptual Logic:**
This is conceptual and requires an LLM client (e.g., OpenAI)
def agentic_hierarchical_chunking(document_text: str):
parent_chunks = document_text.split('\n## ')
processed_data = []
for parent_chunk in parent_chunks:
# Use LLM to generate a summary
summary_prompt = f"Summarize the following text in one sentence: {parent_chunk[:1000]}"
summary = call_llm(summary_prompt)
# Use another method for sub-chunking
sub_chunks = recursive_character_split(parent_chunk, ["\n\n", "\n"], chunk_size=256)
parent_id = generate_uuid()
processed_data.append({
"id": parent_id,
"type": "summary",
"content": summary,
"original_content_ref": parent_id
})
processed_data.append({
"id": parent_id,
"type": "parent_chunk",
"content": parent_chunk
})
for sub_chunk in sub_chunks:
processed_data.append({
"id": generate_uuid(),
"type": "sub_chunk",
"content": sub_chunk,
"parent_id": parent_id
})
return processed_data
This is the most powerful but also the most complex and expensive method. It creates a rich, multi-layered representation of your data that can dramatically improve retrieval for questions that require broad context.
---
## Part 3: A Production-Grade PostgreSQL Schema with `pg_vector`
Your database schema is as important as your chunking strategy. A flat table is insufficient. We need to store our chunks, their embeddings, and the rich metadata that will power our hybrid search.
Here is a robust schema designed for our technical documentation use case:
-- Ensure the pg_vector extension is enabled
CREATE EXTENSION IF NOT EXISTS vector;
-- Table to store source documents and their high-level metadata
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_url TEXT NOT NULL UNIQUE, -- e.g., Confluence URL, Git repo path
document_type VARCHAR(50) NOT NULL, -- 'api_docs', 'tutorial', 'architecture_decision_record'
last_modified TIMESTAMPTZ DEFAULT NOW(),
metadata JSONB -- High-level metadata: author, team, version tags
);
-- Table to store the individual chunks and their embeddings
CREATE TABLE chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_content TEXT NOT NULL,
-- Assuming OpenAI Ada-002 embeddings (1536 dimensions)
-- Choose the dimension based on your embedding model
embedding VECTOR(1536) NOT NULL,
-- Fine-grained metadata specific to this chunk
chunk_metadata JSONB, -- e.g., {'section_title': 'Authentication', 'start_char': 512, 'chunk_level': 2}
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create an index on the document_id for efficient joins
CREATE INDEX idx_chunks_document_id ON chunks(document_id);
-- Create a GIN index on the metadata fields for fast JSONB lookups
CREATE INDEX idx_documents_metadata ON documents USING GIN (metadata);
CREATE INDEX idx_chunks_chunk_metadata ON chunks USING GIN (chunk_metadata);
**Schema Rationale:**
* **`documents` Table:** Acts as the canonical source of truth. Storing high-level metadata here (like `version`) prevents data duplication and allows for efficient filtering across all chunks of a document.
* **`chunks` Table:** This is our workhorse. It stores the text, the vector, and a crucial `chunk_metadata` JSONB field.
* **`document_id` Foreign Key:** This relational link is the superpower of using Postgres. It allows us to join chunk-level vector search results with document-level metadata filters.
* **`JSONB` for Metadata:** Provides the flexibility to store arbitrary structured data. We can filter on keys within this JSON, such as `version`, `section_title`, or `tags`.
* **GIN Indexes:** These are critical for making queries on `JSONB` fields performant. Without them, filtering on metadata would require a slow sequential scan.
---
## Part 4: Implementing High-Performance Hybrid Search
Now we combine our schema and chunking strategies. Pure vector search is not enough. Consider the query: *"What changed in the v2.5 API regarding billing?"*
A pure vector search might pull semantically similar content from v1.0 or v3.0 of the API docs, leading to incorrect answers. We must filter *before* we search.
### Pre-filtering vs. Post-filtering
* **Post-filtering (Bad):** Fetch the top `k` (e.g., 100) nearest neighbors based on vector similarity *first*, then apply the `WHERE` clause to filter that small result set. This is inefficient and often inaccurate, as the *truly* relevant (but slightly less similar) chunks might not even be in the initial top `k` results.
* **Pre-filtering (Good):** Apply the metadata `WHERE` clause *first* to narrow down the search space, and then perform the vector search only on the filtered subset. This is far more efficient and accurate.
With `pg_vector` and our schema, we can write a single, powerful SQL query that performs pre-filtering.
### The Hybrid Search SQL Query
Let's assume we have a placeholder function `get_embedding_vector(:query_text)` that returns the vector for our query.
-- Our user's query
-- :query_text = 'What changed in the v2.5 API regarding billing?'
-- :version_filter = '2.5%'
-- :doc_type_filter = 'api_docs'
-- :limit = 10
SELECT
c.chunk_content,
c.chunk_metadata,
d.source_url,
-- The <=> operator calculates cosine distance (0=identical, 2=opposite)
c.embedding <=> get_embedding_vector(:query_text) AS distance
FROM
chunks c
JOIN
documents d ON c.document_id = d.id
WHERE
-- Metadata filtering happens here, BEFORE the vector search
d.document_type = :doc_type_filter AND
(d.metadata->>'version')::text LIKE :version_filter
ORDER BY
distance
LIMIT
:limit;
**Dissecting the Query:**
1. **`JOIN documents d ON c.document_id = d.id`:** We link chunks to their parent documents.
2. **`WHERE d.document_type = ... AND (d.metadata->>'version')::text LIKE ...`:** This is the pre-filtering magic. The database query planner will use the B-tree and GIN indexes on the `documents` table to rapidly identify only the documents that match our criteria (`api_docs` of version `2.5.x`).
3. **`ORDER BY c.embedding <=> ...`:** The vector similarity search is performed *only on the chunks belonging to the filtered documents*. The `<=>` operator is the cosine distance operator from `pg_vector`.
4. **`LIMIT 10`:** We retrieve the top 10 most relevant chunks from the filtered set.
This single query is vastly superior to a multi-step process. It's atomic, efficient, and leverages decades of query optimization from PostgreSQL.
---
## Part 5: Production Performance, Indexing, and Edge Cases
Writing the query is one thing; making it run in milliseconds on millions of vectors is another. This is where indexing becomes non-negotiable.
### Critical `pg_vector` Indexing
`pg_vector` supports Approximate Nearest Neighbor (ANN) search through indexes, which trades perfect accuracy for massive speed improvements. The two main types are IVFFlat and HNSW.
* **HNSW (Hierarchical Navigable Small World):** Generally the recommended choice. It offers excellent query performance with high recall and is robust to data distribution. It has two key parameters:
* `m`: The maximum number of connections per layer (default 16).
* `ef_construction`: The size of the dynamic candidate list during index build (default 64).
**Creating the HNSW Index:**
-- This can take a long time on a large table!
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
**Tuning HNSW:**
* Higher `m` and `ef_construction` values lead to a more accurate (higher recall) index that takes longer to build and uses more memory.
* For production, start with the defaults and benchmark. A good starting point for tuning might be `m = 24` and `ef_construction = 128`.
* You can also tune search-time accuracy vs. speed with the `hnsw.ef_search` parameter: `SET hnsw.ef_search = 100;` before running your query.
### Analyzing the Query Plan with `EXPLAIN ANALYZE`
Never guess about performance. Use `EXPLAIN ANALYZE` to see exactly how PostgreSQL is executing your query.
**Without a vector index:**
-> Limit (cost=... rows=10 width=...)
-> Sort (cost=... rows=... width=...)
Sort Key: ((embedding <=> ...))
-> Nested Loop
-> Index Scan using ... on documents
-> Seq Scan on chunks
Notice the `Seq Scan on chunks`. The database is brute-force calculating the distance for every single chunk that passes the metadata filter. This will be incredibly slow.
**With an HNSW vector index:**
-> Limit (cost=... rows=10 width=...)
-> Index Scan using
Order By: (embedding <=> ...)
-> Nested Loop
-> Index Scan using ... on documents
Now, you see an `Index Scan` using your HNSW index. This is an ANN search, which will be orders of magnitude faster.
### Edge Case: The "Lost in the Middle" Problem
ANN search is great for finding the top `k` results, but what if a key piece of information is buried in the middle of a long document? It might be semantically relevant but not score high enough on cosine similarity to make it into your top 10. The solution is **re-ranking**.
1. **Retrieve Broadly:** Fetch a larger number of candidates from your database, e.g., `k=50`.
2. **Re-rank with a Cross-Encoder:** Use a more powerful, computationally expensive cross-encoder model (like a `Cohere Rerank` or a custom BERT model). Unlike a bi-encoder (which creates embeddings separately), a cross-encoder takes the query and a document chunk *together* as input and outputs a single relevance score. This allows it to capture much finer-grained nuances.
**Python Re-ranking Logic:**
1. Fetch top 50 candidates from Postgres
initial_candidates = execute_hybrid_search_query(query, limit=50)
2. Use a cross-encoder to re-rank
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [[query, candidate['chunk_content']] for candidate in initial_candidates]
scores = cross_encoder.predict(pairs)
Combine candidates with new scores and sort
for i, candidate in enumerate(initial_candidates):
candidate['rerank_score'] = scores[i]
reranked_results = sorted(initial_candidates, key=lambda x: x['rerank_score'], reverse=True)
3. Use the top 5 from the re-ranked list as your final context
final_context = reranked_results[:5]