Advanced RAG: Graph Indexing for Complex Document Hierarchies
The Contextual Void: Why Flat Vector Search Fails in Complex RAG
Retrieval-Augmented Generation (RAG) has become a cornerstone of modern LLM applications. The standard pipeline—chunk, embed, store, retrieve, augment—is powerful for question-answering over a corpus of independent documents. However, this paradigm reveals significant weaknesses when applied to real-world, interconnected knowledge bases like corporate wikis, software documentation, legal case files, or research papers. These documents are not islands; they exist in a rich web of explicit and implicit relationships.
A standard vector search RAG system, when asked, "How does the AuthService
's rate-limiting policy affect the BillingAPI
?", might retrieve a chunk from the AuthService
documentation mentioning rate-limiting and another from the BillingAPI
documentation. It then forces the LLM to bridge the contextual gap, often leading to hallucinations or incomplete answers because the explicit link between the two services (e.g., an API call dependency defined in an OpenAPI spec) was never retrieved.
This is the contextual void. Flat vector search treats every chunk as an independent, semantically-related peer. It's blind to:
* Hierarchy: A code comment is part of a function, which is in a file, which belongs to a module. This hierarchy is critical context.
* Explicit References: Document A explicitly cites Document B.
* Shared Entities: Two seemingly unrelated documents might both discuss a specific internal project, Project Titan
, creating a critical thematic link.
* Causality and Dependency: A system design document outlines a dependency that is later implemented in a specific code module.
To build truly intelligent RAG systems, we must move beyond semantic similarity and model the structure of knowledge. This is where Graph-RAG comes in. By representing our corpus as a knowledge graph, we can perform sophisticated, multi-hop retrievals that capture both semantic similarity and explicit relationships, providing the LLM with a far richer, more accurate context.
This article details a production-ready approach to building a Graph-RAG system using Neo4j for graph storage, LLMs for intelligent entity/relationship extraction, and a hybrid retrieval strategy that combines the best of vector search and graph traversal.
The Graph-RAG Paradigm: Modeling Knowledge as a Connected Web
Instead of a flat list of vectors, we model our corpus as a labeled property graph. This structure consists of nodes (entities) and relationships (edges) that connect them.
Our Core Node Types:
* Document
: Represents a whole document (e.g., a Markdown file, a PDF, a web page). Properties: id
, source_url
, title
.
* Chunk
: A text segment from a Document
. This is the node that will hold the vector embedding. Properties: id
, text
, embedding
, start_char
, end_char
.
* Entity
: A named entity extracted from the text (e.g., a person, a software component, a legal term). Properties: id
, name
, type
.
Our Core Relationship Types:
* HAS_CHUNK
: Connects a Document
to its constituent Chunk
nodes.
* MENTIONS
: Connects a Chunk
to an Entity
it discusses.
* REFERENCES
: Connects one Document
to another (e.g., via a hyperlink or citation).
* PART_OF
: Creates hierarchical relationships (e.g., a Chunk
representing a function is PART_OF
a Chunk
representing a file).
This model allows us to answer complex queries. The question about AuthService
and BillingAPI
is no longer a simple vector search. It becomes a graph traversal problem: "Find chunks mentioning AuthService
and rate-limiting, then explore their connected nodes to see if any paths lead to nodes related to BillingAPI
."
Implementation Deep Dive: Building the Knowledge Graph
Let's build this system. Our stack will be Python, the neo4j
driver, OpenAI's API for extraction and generation, and a sentence-transformer model for embeddings.
Step 1: Advanced Entity and Relationship Extraction with an LLM
This is the most critical step and where many projects fail. Simply extracting named entities is not enough. We need to extract relationships between them. We'll use an LLM with a carefully engineered prompt to act as a zero-shot information extractor.
The Prompt Engineering:
The key is to constrain the LLM's output to a predictable JSON schema. We provide the schema and instruct the model to populate it based on the text.
import openai
import json
# Ensure you have your OPENAI_API_KEY set in your environment
EXTRACTION_SYSTEM_PROMPT = """
You are an expert data analyst. Your task is to extract entities and their relationships from the provided text.
Extract the following information:
- Documents: The main subjects of the text.
- Entities: Key concepts, components, persons, or technologies mentioned.
- Relationships: Connections between the extracted documents and entities.
Respond ONLY with a valid JSON object in the following schema:
{
"documents": [{"id": "<document_name>", "type": "<document_type>"}],
"entities": [{"id": "<entity_name>", "type": "<entity_type>"}],
"relationships": [{"source": "<source_id>", "target": "<target_id>", "type": "<relationship_type>"}]
}
Possible entity types: 'Service', 'API', 'Policy', 'Project', 'Person'.
Possible relationship types: 'USES', 'AFFECTS', 'DEFINES', 'PART_OF', 'MENTIONS'.
Analyze the text and populate the schema. Ensure all `source` and `target` IDs in relationships match an ID from the documents or entities lists.
"""
def extract_graph_from_document(text: str) -> dict:
"""Uses an LLM to extract a graph structure from a text document."""
try:
response = openai.chat.completions.create(
model="gpt-4-1106-preview", # Or your preferred model
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": EXTRACTION_SYSTEM_PROMPT},
{"role": "user", "content": f"Here is the text to analyze:\n\n{text}"}
]
)
return json.loads(response.choices[0].message.content)
except Exception as e:
print(f"Error during graph extraction: {e}")
return {"documents": [], "entities": [], "relationships": []}
# Example Usage
document_text = """
Title: AuthService Architecture
The AuthService is a core component of Project Titan. It defines the primary rate-limiting policy for all internal traffic. This policy directly affects the BillingAPI, which uses the AuthService for user authentication. The main logic is implemented in `auth.py`.
"""
graph_data = extract_graph_from_document(document_text)
print(json.dumps(graph_data, indent=2))
This LLM-based approach is powerful but has production implications. It's slower and more expensive than traditional NLP methods. For large-scale ingestion, this should be run as an asynchronous batch process. You might also consider fine-tuning a smaller, open-source model for this specific JSON extraction task to reduce costs and improve latency.
Step 2: Intelligent Chunking and Embedding
Simple fixed-size chunking is suboptimal. A chunk might end mid-sentence, destroying its semantic meaning. We'll use a Document
-> Chunk
hierarchy.
First, we'll create a Document
node. Then, we'll chunk the text. A simple strategy is paragraph-based chunking. For code, it could be function-based. The key is that each Chunk
node will be linked to its parent Document
.
from sentence_transformers import SentenceTransformer
# Use a high-quality, pre-trained model
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
def create_document_chunks(doc_id: str, doc_text: str, chunk_size=512, overlap=50):
# A more robust implementation would use semantic chunking or paragraph splitting
chunks = []
for i in range(0, len(doc_text), chunk_size - overlap):
chunk_text = doc_text[i:i + chunk_size]
chunks.append({
"document_id": doc_id,
"text": chunk_text,
"embedding": embedding_model.encode(chunk_text).tolist()
})
return chunks
Step 3: Populating the Neo4j Knowledge Graph
With our extracted graph data and chunks, we can now populate Neo4j. We will use idempotent MERGE
queries to avoid creating duplicate nodes.
from neo4j import GraphDatabase
class KnowledgeGraph:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def close(self):
self.driver.close()
def ingest_data(self, graph_data: dict, chunks: list):
with self.driver.session() as session:
session.execute_write(self._ingest_graph, graph_data, chunks)
@staticmethod
def _ingest_graph(tx, graph_data, chunks):
# Ingest Documents and Entities
for doc in graph_data.get('documents', []):
tx.run("MERGE (d:Document {id: $id}) SET d.type = $type", id=doc['id'], type=doc.get('type'))
for entity in graph_data.get('entities', []):
tx.run("MERGE (e:Entity {id: $id}) SET e.type = $type", id=entity['id'], type=entity.get('type'))
# Ingest Chunks and link to parent Document
for i, chunk in enumerate(chunks):
chunk_id = f"{chunk['document_id']}-chunk-{i}"
tx.run("""
MATCH (d:Document {id: $doc_id})
MERGE (c:Chunk {id: $chunk_id})
SET c.text = $text, c.embedding = $embedding
MERGE (d)-[:HAS_CHUNK]->(c)
""", doc_id=chunk['document_id'], chunk_id=chunk_id, text=chunk['text'], embedding=chunk['embedding'])
# Ingest Relationships
for rel in graph_data.get('relationships', []):
tx.run("""
MATCH (source) WHERE source.id = $source_id
MATCH (target) WHERE target.id = $target_id
MERGE (source)-[r:%s]->(target)
""" % rel['type'], source_id=rel['source'], target_id=rel['target'])
# Usage:
# kg = KnowledgeGraph("neo4j://localhost:7687", "neo4j", "password")
# # Assuming 'graph_data' from LLM and 'chunks' from chunking process
# document_id = graph_data['documents'][0]['id']
# chunks = create_document_chunks(document_id, document_text)
# kg.ingest_data(graph_data, chunks)
# kg.close()
Step 4: Creating a Vector Index in Neo4j
To combine graph traversal with vector search, we need a vector index on our Chunk
nodes. This is a crucial step for our hybrid retrieval strategy.
Execute this Cypher query directly in Neo4j Browser or via the driver:
CREATE VECTOR INDEX `chunk_embeddings` IF NOT EXISTS
FOR (c:Chunk)
ON (c.embedding)
OPTIONS {indexConfig: {
`vector.dimensions`: 768, // Must match your embedding model's dimensions
`vector.similarity_function`: 'cosine'
}}
This command tells Neo4j to build and maintain an efficient index for performing similarity searches on the embedding
property of all Chunk
nodes.
Advanced Retrieval: Multi-Hop Hybrid Queries
This is where the Graph-RAG approach truly shines. Our retrieval process is no longer a single API call but a multi-step workflow.
Chunk
nodes most semantically similar to the user's query.Here’s how to implement it:
class GraphRAGQueryEngine:
def __init__(self, driver, embedding_model):
self.driver = driver
self.embedding_model = embedding_model
def query(self, query_text: str):
query_vector = self.embedding_model.encode(query_text).tolist()
with self.driver.session() as session:
# Step 1 & 2: Hybrid vector search and graph traversal in one query
result = session.run("""
CALL db.index.vector.queryNodes('chunk_embeddings', 5, $query_vector) YIELD node AS similar_chunk, score
// Find the parent document of the similar chunk
MATCH (d:Document)-[:HAS_CHUNK]->(similar_chunk)
// Expand context: Find other chunks from the same document
OPTIONAL MATCH (similar_chunk)<-[:HAS_CHUNK]-(d)-[:HAS_CHUNK]->(other_chunk)
WHERE other_chunk <> similar_chunk
// Expand context: Find entities mentioned in the similar chunk and documents they appear in elsewhere
OPTIONAL MATCH (similar_chunk)-[:MENTIONS]->(e:Entity)<-[:MENTIONS]-(related_chunk:Chunk)
WHERE related_chunk <> similar_chunk
WITH similar_chunk, score, d,
collect(DISTINCT other_chunk.text) AS other_chunks_in_doc,
collect(DISTINCT {entity: e.id, related_chunk: related_chunk.text}) as related_info
RETURN
similar_chunk.text AS text,
score,
d.id AS document_id,
other_chunks_in_doc,
related_info
ORDER BY score DESC
""", query_vector=query_vector)
context = self._synthesize_context(result)
# Step 3: Augment and Generate
final_answer = self._generate_response(query_text, context)
return final_answer
def _synthesize_context(self, result) -> str:
context_str = ""
for record in result:
context_str += f"\n---\nSource Document: {record['document_id']} (Similarity Score: {record['score']:.4f})\n"
context_str += f"Retrieved Chunk: {record['text']}\n"
if record['other_chunks_in_doc']:
context_str += "\nOther relevant chunks from the same document:\n"
for text in record['other_chunks_in_doc'][:2]: # Limit for brevity
context_str += f"- {text}\n"
if record['related_info']:
context_str += "\nRelated information from other documents via shared entities:\n"
for info in record['related_info'][:2]:
if info['entity'] and info['related_chunk']:
context_str += f"- Entity '{info['entity']}' is also mentioned in: '{info['related_chunk']}'\n"
return context_str
def _generate_response(self, query, context):
prompt = f"""
You are a helpful AI assistant. Answer the user's question based on the following context retrieved from a knowledge graph.
Be concise and precise. If the context does not contain the answer, say so.
Context:
{context}
Question: {query}
Answer:
"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are an expert Q&A system."},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
# Example Usage:
# driver = GraphDatabase.driver("neo4j://localhost:7687", auth=("neo4j", "password"))
# model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
# engine = GraphRAGQueryEngine(driver, model)
# answer = engine.query("How does the AuthService rate-limiting affect the BillingAPI?")
# print(answer)
# driver.close()
The Cypher query above is the heart of our retrieval. It:
CALL db.index.vector.queryNodes
: Performs the initial vector search to get the top 5 most similar chunks.MATCH (d:Document)-[:HAS_CHUNK]->(similar_chunk)
: Finds the parent document for each retrieved chunk.OPTIONAL MATCH ... (other_chunk)
: Traverses back to the document and then out to its other chunks, gathering local context.OPTIONAL MATCH ... (related_chunk)
: Performs a multi-hop traversal. It finds entities mentioned in the primary chunk, then finds other chunks anywhere in the graph that also mention those same entities. This is how we bridge context across disparate documents.collect(DISTINCT ...)
: Aggregates the expanded context for each initial hit.This single, powerful query provides a rich, structured context far superior to a simple list of semantically similar chunks.
Production Considerations, Edge Cases, and Performance
Deploying a Graph-RAG system requires careful planning.
1. Scalability of Ingestion:
* Problem: The LLM-based extraction is a bottleneck. Running it on millions of documents is slow and expensive.
* Solution:
* Batch Processing: Use a job queue (e.g., Celery, RabbitMQ) to process documents asynchronously.
* Fine-tuning: Fine-tune a smaller, open-source model (like a member of the Llama or Mistral families) on a high-quality dataset of (text, graph_json)
pairs generated by a more powerful model like GPT-4. This drastically reduces per-document cost and latency.
* Parallelization: Use tools like Ray or Spark to distribute the extraction and embedding process across multiple workers.
2. Graph Maintenance and Updates:
* Problem: Documents get updated or deleted. How do you keep the graph consistent?
* Solution:
* Versioning: Add a version
or last_updated
property to Document
nodes. Your ingestion pipeline should check this before processing.
* Stale Component Deletion: For updates, detach and delete old chunks and relationships associated with a Document
before ingesting the new version. This can be done in a single transaction to maintain consistency.
* TTL (Time-To-Live): For transient data, consider using a TTL index in Neo4j to automatically expire old nodes.
3. Query Performance Optimization:
* Problem: Complex Cypher queries can be slow on large graphs.
* Solution:
* Profiling: Use PROFILE
or EXPLAIN
in front of your Cypher queries to understand their execution plan. Look for full graph scans (NodeByLabelScan
) and high database hits.
* Schema Indexes: Create traditional indexes on node properties used in MATCH
clauses, like id
and type
. CREATE INDEX ON :Document(id);
and CREATE INDEX ON :Entity(id);
are essential.
* Query Parameterization: Always use parameters (like $query_vector
) instead of string formatting. This allows Neo4j to cache the query plan for much faster subsequent executions.
Limiting Traversal Depth: In highly connected graphs, unbound traversals can explode. Use variable-length path limits, e.g., MATCH (a)-[
1..3]-(b) to limit traversals to 3 hops.
4. Handling Low-Quality Initial Retrieval:
* Problem: What if the initial vector search returns irrelevant chunks? The subsequent graph traversal will explore the wrong part of the graph.
* Solution:
* Hybrid Search Fallback: Combine vector search with keyword-based full-text search. Neo4j supports full-text indexes. You can run both searches and combine the results.
* Re-ranking: Retrieve more initial candidates than you need (e.g., top 20) and then use a more sophisticated re-ranking model (like a cross-encoder) or business logic to select the best starting points for graph traversal.
* Query Expansion: Use an LLM to rewrite the user's query into several variations or to extract key entities from the query itself, which can then be used to directly look up nodes in the graph, bypassing the initial vector search entirely for certain query types.
Conclusion: The Next Frontier of RAG
By moving from flat vector stores to structured knowledge graphs, we fundamentally upgrade the capabilities of our RAG systems. This Graph-RAG approach transforms retrieval from a simple similarity search into a sophisticated reasoning process over the relationships inherent in our data.
The implementation is more complex than a standard RAG pipeline, but the payoff is a system that can answer nuanced, multi-part questions that are impossible for vector-only systems. It provides more accurate, explainable, and contextually-aware responses, pushing the boundaries of what's possible with LLM-powered applications. The future of RAG is not just about finding similar text; it's about understanding the connections between them.