Stateful RAG: Multi-Turn Conversation with Graph Databases
The Achilles' Heel of Standard RAG: Conversational Amnesia
Retrieval-Augmented Generation (RAG) has become the de facto standard for grounding Large Language Models (LLMs) in factual, external knowledge. The canonical implementation is brutally effective yet fundamentally stateless: a user query is converted to an embedding, a vector database retrieves semantically similar document chunks, and these chunks are prepended to the prompt as context for the LLM. This works beautifully for single-shot, atomic questions.
However, the moment a user asks a follow-up question—"What about its performance in Europe?"—the stateless model collapses. The pronoun "its" is meaningless without the context of the previous turn. The standard solution, a sliding window of chat history, is a crude patch. It crams unstructured text into the context window, consuming precious tokens, failing to distinguish key entities from conversational filler, and offering no mechanism for long-term memory or recalling information from much earlier in the conversation.
To build sophisticated conversational agents, we must solve the problem of conversational state. This requires a system that not only remembers what was said but understands the relationships between the entities discussed. This is not a job for a key-value store or a relational database; it is a quintessential graph problem. This article details a production-grade architecture for building a stateful, multi-turn RAG system using a graph database (Neo4j) to create a dynamic, persistent model of the conversation's context.
The Graph-Based Context Model: From Chat Log to Knowledge Graph
Instead of viewing a conversation as a linear sequence of messages, we will model it as an evolving graph of interconnected entities and concepts. This paradigm shift is the core of our stateful architecture.
Why a Graph Database?
Graph databases like Neo4j are purpose-built to store and query highly connected data. In our conversational context, this means:
Entity nodes. This gives them a distinct, addressable identity.Query node can MENTION an Entity node, and two Entity nodes can be RELATED_TO each other.JOIN operations to approximate this, while a graph database handles it natively and efficiently.Designing the Conversational Graph Schema
A robust schema is critical. We will define a set of node labels and relationship types to structure our conversational memory:
Node Labels:
* Session: Represents a single, continuous conversation with a user. It acts as the entry point for all queries related to that conversation.
* Turn: Represents a single user query and the LLM's response. It's a container for a request/response pair.
* Query: The user's specific utterance.
* Response: The LLM's generated answer.
* Entity: A named entity (e.g., 'Kubernetes', 'Project Titan', 'Q4-2023-report.pdf') extracted from queries or responses.
* DocumentChunk: A node representing a specific chunk of text from our knowledge base, linked to its vector embedding ID.
Relationship Types:
* HAS_TURN: (Session)-[:HAS_TURN]->(Turn)
* NEXT_TURN: (Turn)-[:NEXT_TURN]->(Turn) (Creates a temporal chain)
* HAS_QUERY: (Turn)-[:HAS_QUERY]->(Query)
* HAS_RESPONSE: (Turn)-[:HAS_RESPONSE]->(Response)
* MENTIONS: (Query)-[:MENTIONS]->(Entity), (Response)-[:MENTIONS]->(Entity)
* SOURCED_FROM: (Response)-[:SOURCED_FROM]->(DocumentChunk)
* RELATED_TO: (Entity)-[:RELATED_TO]->(Entity) (Can be populated offline or inferred during conversation)
Here is a visualization of how a few turns of conversation might be represented:
graph TD
subgraph Conversation Session: sess_123
S(Session) -- HAS_TURN --> T1(Turn 1)
T1 -- NEXT_TURN --> T2(Turn 2)
T1 -- HAS_QUERY --> Q1(Query: "Tell me about Project X")
T1 -- HAS_RESPONSE --> R1(Response: "Project X is a... It uses Kubernetes.")
T2 -- HAS_QUERY --> Q2(Query: "How does it scale?")
T2 -- HAS_RESPONSE --> R2(Response: "It scales using HPA...")
end
subgraph Knowledge Graph
E_ProjX(Entity: Project X)
E_K8s(Entity: Kubernetes)
E_HPA(Entity: HPA)
end
Q1 -- MENTIONS --> E_ProjX
R1 -- MENTIONS --> E_ProjX
R1 -- MENTIONS --> E_K8s
Q2 -- MENTIONS --> E_ProjX; // Coreference resolution links "it" to Project X
R2 -- MENTIONS --> E_HPA
E_K8s -- RELATED_TO --> E_HPA
This model transforms a flat chat history into a rich, queryable knowledge graph specific to the conversation, enabling us to retrieve highly relevant, structured context.
The Stateful RAG Loop: A Five-Step Implementation
Let's walk through the full lifecycle of a single conversational turn in our stateful system. We'll use Python, the neo4j driver, and a hypothetical vector DB client.
Step 1: Ingestion and Coreference Resolution
When a user query arrives, the first step is to understand what it's about. This involves more than just embedding the raw string.
import spacy
# Load a model with a coreference resolution component
nlp = spacy.load("en_core_web_trf")
def process_user_query(session_id: str, query_text: str, graph_driver):
"""Extracts entities and resolves coreferences using graph context."""
# 1. Get recent entities from the graph for context
with graph_driver.session() as session:
result = session.run("""
MATCH (s:Session {id: $session_id})-[:HAS_TURN]->(t:Turn)
-[:HAS_QUERY|HAS_RESPONSE]->(m)-[:MENTIONS]->(e:Entity)
WITH t, e ORDER BY t.timestamp DESC
RETURN e.name AS entity_name
LIMIT 10
""", session_id=session_id)
recent_entities = [record["entity_name"] for record in result]
# This is a simplified example. A real implementation might use a more
# sophisticated prompt or model for coreference resolution.
# The key is feeding the model the list of candidate entities.
contextual_query = f"""
Recent entities mentioned: {', '.join(recent_entities)}.
Resolve pronouns in the following user query: '{query_text}'
"""
# In a real app, you would send this to an LLM to get a resolved query.
# For this example, let's assume a simple replacement.
# e.g., if recent_entities included 'Project Titan' and query was 'How does it scale?',
# the resolved query would be 'How does Project Titan scale?'
resolved_query_text = resolve_coreferences_with_llm(contextual_query) # Placeholder for LLM call
doc = nlp(resolved_query_text)
entities = [ent.text for ent in doc.ents]
return resolved_query_text, entities
# Placeholder for a function that would call an LLM
def resolve_coreferences_with_llm(prompt: str) -> str:
# In a real implementation, you'd use OpenAI, Anthropic, etc.
# to process the prompt and return the resolved query text.
# For now, we'll just return the original query part.
return prompt.split("user query: '")[1].split("'")[0]
Step 2: Hybrid Context Retrieval (Graph + Vector)
Now we retrieve context using a two-pronged approach:
Graph Retrieval (Explicit Context): We query the graph to find entities, documents, and past turns that are explicitly* linked to the entities in our current query. This provides precise, structured context.
* Vector Retrieval (Semantic Context): We embed the resolved_query_text and perform a similarity search in our vector database. This catches related concepts that aren't directly linked in the graph.
Here's the Cypher query for graph retrieval. It starts from the current session, finds entities mentioned in the new query, and then traverses outwards to find related entities and the document chunks where they were originally mentioned.
def retrieve_graph_context(graph_driver, session_id: str, entities: list[str]):
cypher_query = """
// Find entities mentioned in the current query
MATCH (e:Entity) WHERE e.name IN $entities
WITH collect(e) as current_entities
// Find other entities related to the current ones
MATCH (e) WHERE e in current_entities
OPTIONAL MATCH (e)-[:RELATED_TO]-(related_e:Entity)
WITH current_entities + collect(related_e) as all_relevant_entities
UNWIND all_relevant_entities as relevant_entity
// Find document chunks where these relevant entities were mentioned
// and prioritize chunks linked to recent turns in this session.
MATCH (chunk:DocumentChunk)<-[:SOURCED_FROM]-(resp:Response)<-[:HAS_RESPONSE]-(t:Turn)
-[:HAS_TURN*0..5]-(s:Session {id: $session_id})
WHERE (resp)-[:MENTIONS]->(relevant_entity)
RETURN DISTINCT chunk.text as text, 1.0 AS score // Give high score to direct graph hits
LIMIT 5
UNION
// Fallback: find chunks related to the entities, not tied to this session
MATCH (e:Entity)<-[:MENTIONS]-(resp:Response)-[:SOURCED_FROM]->(chunk:DocumentChunk)
WHERE e.name IN $entities
RETURN DISTINCT chunk.text as text, 0.5 AS score
LIMIT 5
"""
with graph_driver.session() as session:
result = session.run(cypher_query, entities=entities, session_id=session_id)
return [record["text"] for record in result]
This query is powerful. It prioritizes information that has been relevant in this specific conversation before falling back to general knowledge, perfectly mirroring human conversational recall.
Step 3: Merging and Re-ranking Contexts
We now have two sets of context chunks: one from the graph and one from the vector DB. Simply concatenating them is suboptimal. A better approach is to use a re-ranking model (e.g., a cross-encoder) to score all retrieved chunks against the user's query and select the top-k most relevant results from the combined pool. This ensures the final context is dense with relevant information.
from sentence_transformers.cross_encoder import CrossEncoder
# This would be loaded once in your application
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_contexts(query: str, contexts: list[str]):
pairs = [[query, ctx] for ctx in contexts]
scores = cross_encoder.predict(pairs)
scored_contexts = zip(scores, contexts)
scored_contexts = sorted(scored_contexts, key=lambda x: x[0], reverse=True)
# Return the top N contexts
return [ctx for score, ctx in scored_contexts[:5]]
# In our main flow:
graph_context = retrieve_graph_context(...)
vector_context = retrieve_vector_context(...) # Assume this function exists
combined_context = list(set(graph_context + vector_context)) # De-duplicate
final_context = rerank_contexts(resolved_query_text, combined_context)
Step 4: Prompt Engineering and LLM Invocation
We construct a prompt that leverages our hard-won context. The structure should clearly delineate the different types of information.
SYSTEM: You are a helpful AI assistant with a persistent memory of this conversation.
CONVERSATIONAL CONTEXT GRAPH:
The user has previously discussed these key entities: ["Project Titan", "Kubernetes"]. The last response was about Horizontal Pod Autoscaling (HPA).
RETRIEVED KNOWLEDGE:
---
Context 1: [Text of the first re-ranked document chunk]
---
Context 2: [Text of the second re-ranked document chunk]
---
USER QUERY:
{resolved_query_text}
ASSISTANT RESPONSE:
This structured prompt helps the LLM differentiate between conversational history and retrieved documents, leading to more accurate and contextually-aware responses.
Step 5: Updating the Graph (Closing the Loop)
This final step is what makes the system stateful. After the LLM generates a response, we parse it and update our graph to incorporate the new information from this turn.
This must be done in a single atomic transaction to ensure data integrity.
def update_graph_with_turn(driver, session_id: str, query_text: str, query_entities: list, response_text: str, used_chunks: list):
# First, extract entities from the LLM's response
doc = nlp(response_text)
response_entities = [ent.text for ent in doc.ents]
all_entities = list(set(query_entities + response_entities))
cypher_transaction = """
// Ensure all entities exist
UNWIND $all_entities as entity_name
MERGE (e:Entity {name: entity_name})
// Find the session and the last turn
MATCH (s:Session {id: $session_id})
OPTIONAL MATCH (s)-[:HAS_TURN]->(last_turn:Turn)
WHERE NOT (last_turn)-[:NEXT_TURN]->()
// Create the new turn, query, and response nodes
CREATE (t:Turn {timestamp: timestamp()})
CREATE (q:Query {text: $query_text})
CREATE (r:Response {text: $response_text})
// Link them together
CREATE (s)-[:HAS_TURN]->(t)
CREATE (t)-[:HAS_QUERY]->(q)
CREATE (t)-[:HAS_RESPONSE]->(r)
// If there was a last turn, link it to the new one
WITH t, q, r, last_turn
FOREACH (lt IN CASE WHEN last_turn IS NULL THEN [] ELSE [last_turn] END |
CREATE (lt)-[:NEXT_TURN]->(t)
)
// Link query to its entities
WITH t, q, r
UNWIND $query_entities as entity_name
MATCH (e:Entity {name: entity_name})
MERGE (q)-[:MENTIONS]->(e)
// Link response to its entities
WITH t, q, r
UNWIND $response_entities as entity_name
MATCH (e:Entity {name: entity_name})
MERGE (r)-[:MENTIONS]->(e)
// Link response to the document chunks it was sourced from
WITH t, q, r
UNWIND $used_chunks as chunk_text
MATCH (c:DocumentChunk {text: chunk_text})
MERGE (r)-[:SOURCED_FROM]->(c)
"""
with driver.session() as session:
session.execute_write(lambda tx: tx.run(cypher_transaction,
session_id=session_id,
query_text=query_text,
response_text=response_text,
query_entities=query_entities,
response_entities=response_entities,
all_entities=all_entities,
used_chunks=used_chunks
))
This transactional update ensures that each turn of the conversation is atomically added to the graph, growing the conversational memory and making it available for subsequent turns.
Advanced Considerations and Production Edge Cases
Deploying this architecture requires handling several complex scenarios.
1. Context Pruning and Summarization
A long-running conversation can lead to a massive graph, slowing down queries. We need strategies to manage its size:
Time-based Windowing: In the retrieval query, limit the traversal depth (e.g., (s)-[:HAS_TURN0..10]->(t) to only consider the last 10 turns).
* Entity Salience Scoring: Assign a salience score to entities based on frequency and recency of mentions. Deprioritize or ignore low-salience entities during retrieval.
* Conversational Summarization: For very long conversations, periodically use an LLM to summarize a branch of the conversation. Create a Summary node that condenses a series of Turn nodes, and link it to the key entities from that segment. The retrieval query can then pull from these summaries instead of traversing every single turn.
2. Ambiguity Resolution
What if the user mentions "Apple"? The company or the fruit? The graph provides the context needed to disambiguate.
Modify the retrieval query to check the neighborhood of candidate entities. If a query mentions "Apple" and "iPhone", the query can favor the Entity node for 'Apple' that is already RELATED_TO the Entity for 'iPhone'.
// Cypher snippet for disambiguation
MATCH (e:Entity {name: "Apple"})
// Check for relationships to other entities in the query
MATCH (other_e:Entity) WHERE other_e.name IN ["iPhone", "Tim Cook"]
// Score the 'Apple' entity higher if it's connected
WITH e, COUNT((e)-[:RELATED_TO]-(other_e)) as score
RETURN e ORDER BY score DESC LIMIT 1
3. Handling Topic Drift
Users often change subjects abruptly. Our system should recognize this. We can detect a topic drift by analyzing the overlap of entities between the current turn and the previous N turns. If the Jaccard similarity of the entity sets is below a certain threshold, we can infer a topic change.
Upon detection, we can either:
a) Start a new "branch" in the graph, creating a TopicSegment node.
b) Heavily down-weight the context from the previous topic during retrieval.
4. Performance and Scalability
* Query Optimization: Every Cypher query must be profiled (PROFILE or EXPLAIN). Ensure proper indexes are created on key node properties, especially Session(id) and Entity(name). A composite index on (label, property) is often necessary.
* Write Latency: The graph update in Step 5 adds latency to each turn. For applications requiring real-time responses, this write can be performed asynchronously. The user gets their response immediately, and the graph is updated in the background. The trade-off is that the very next, immediate query might not have access to the absolute latest turn's context.
* Caching: Implement a caching layer (like Redis) for frequently accessed data, such as the entity list for active sessions. This can reduce read load on the graph database.
Conclusion: Beyond Stateless RAG
Building a stateful RAG system is a significant architectural undertaking. It moves beyond simple vector search and demands a more holistic approach to context management. By modeling conversations as a dynamic, evolving knowledge graph, we create a persistent memory that enables LLMs to engage in truly coherent, multi-turn dialogue.
The patterns discussed here—hybrid retrieval, coreference resolution using graph context, and transactional state updates—provide a blueprint for the next generation of conversational AI. While the implementation complexity is higher than stateless RAG, the payoff is an AI agent that can remember, reason about, and reference past interactions, transforming it from a simple information retrieval tool into a genuine conversational partner.