Stateful RAG: Multi-Turn Conversation with Graph Databases

October 3, 2025

16 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Achilles' Heel of Standard RAG: Conversational Amnesia

Retrieval-Augmented Generation (RAG) has become the de facto standard for grounding Large Language Models (LLMs) in factual, external knowledge. The canonical implementation is brutally effective yet fundamentally stateless: a user query is converted to an embedding, a vector database retrieves semantically similar document chunks, and these chunks are prepended to the prompt as context for the LLM. This works beautifully for single-shot, atomic questions.

However, the moment a user asks a follow-up question—"What about its performance in Europe?"—the stateless model collapses. The pronoun "its" is meaningless without the context of the previous turn. The standard solution, a sliding window of chat history, is a crude patch. It crams unstructured text into the context window, consuming precious tokens, failing to distinguish key entities from conversational filler, and offering no mechanism for long-term memory or recalling information from much earlier in the conversation.

To build sophisticated conversational agents, we must solve the problem of conversational state. This requires a system that not only remembers what was said but understands the relationships between the entities discussed. This is not a job for a key-value store or a relational database; it is a quintessential graph problem. This article details a production-grade architecture for building a stateful, multi-turn RAG system using a graph database (Neo4j) to create a dynamic, persistent model of the conversation's context.

The Graph-Based Context Model: From Chat Log to Knowledge Graph

Instead of viewing a conversation as a linear sequence of messages, we will model it as an evolving graph of interconnected entities and concepts. This paradigm shift is the core of our stateful architecture.

Why a Graph Database?

Graph databases like Neo4j are purpose-built to store and query highly connected data. In our conversational context, this means:

Entities as Nodes: The core subjects of the conversation—people, products, locations, technical concepts—are represented as Entity nodes. This gives them a distinct, addressable identity.

Relationships as Edges: The way entities relate to each other and to the conversation itself is captured through typed relationships (edges). For example, a Query node can MENTION an Entity node, and two Entity nodes can be RELATED_TO each other.

Efficient Traversal: The most powerful feature is the ability to perform complex traversals. Starting from the current user's session, we can ask questions like, "Find all entities mentioned in the last 3 turns that are related to 'cloud computing' and were sourced from our internal documentation." A SQL database would require expensive, multi-table JOIN operations to approximate this, while a graph database handles it natively and efficiently.

Designing the Conversational Graph Schema

A robust schema is critical. We will define a set of node labels and relationship types to structure our conversational memory:

Node Labels:

* Session: Represents a single, continuous conversation with a user. It acts as the entry point for all queries related to that conversation.

* Turn: Represents a single user query and the LLM's response. It's a container for a request/response pair.

* Query: The user's specific utterance.

* Response: The LLM's generated answer.

* Entity: A named entity (e.g., 'Kubernetes', 'Project Titan', 'Q4-2023-report.pdf') extracted from queries or responses.

* DocumentChunk: A node representing a specific chunk of text from our knowledge base, linked to its vector embedding ID.

Relationship Types:

* HAS_TURN: (Session)-[:HAS_TURN]->(Turn)

* NEXT_TURN: (Turn)-[:NEXT_TURN]->(Turn) (Creates a temporal chain)

* HAS_QUERY: (Turn)-[:HAS_QUERY]->(Query)

* HAS_RESPONSE: (Turn)-[:HAS_RESPONSE]->(Response)

* MENTIONS: (Query)-[:MENTIONS]->(Entity), (Response)-[:MENTIONS]->(Entity)

* SOURCED_FROM: (Response)-[:SOURCED_FROM]->(DocumentChunk)

* RELATED_TO: (Entity)-[:RELATED_TO]->(Entity) (Can be populated offline or inferred during conversation)

Here is a visualization of how a few turns of conversation might be represented:

mermaid

graph TD
    subgraph Conversation Session: sess_123
        S(Session) -- HAS_TURN --> T1(Turn 1)
        T1 -- NEXT_TURN --> T2(Turn 2)
        T1 -- HAS_QUERY --> Q1(Query: "Tell me about Project X")
        T1 -- HAS_RESPONSE --> R1(Response: "Project X is a... It uses Kubernetes.")
        T2 -- HAS_QUERY --> Q2(Query: "How does it scale?")
        T2 -- HAS_RESPONSE --> R2(Response: "It scales using HPA...")
    end

    subgraph Knowledge Graph
        E_ProjX(Entity: Project X)
        E_K8s(Entity: Kubernetes)
        E_HPA(Entity: HPA)
    end

    Q1 -- MENTIONS --> E_ProjX
    R1 -- MENTIONS --> E_ProjX
    R1 -- MENTIONS --> E_K8s
    Q2 -- MENTIONS --> E_ProjX;  // Coreference resolution links "it" to Project X
    R2 -- MENTIONS --> E_HPA
    E_K8s -- RELATED_TO --> E_HPA

This model transforms a flat chat history into a rich, queryable knowledge graph specific to the conversation, enabling us to retrieve highly relevant, structured context.

The Stateful RAG Loop: A Five-Step Implementation

Let's walk through the full lifecycle of a single conversational turn in our stateful system. We'll use Python, the neo4j driver, and a hypothetical vector DB client.

Step 1: Ingestion and Coreference Resolution

When a user query arrives, the first step is to understand what it's about. This involves more than just embedding the raw string.

Entity Extraction: Use a fine-tuned NER (Named Entity Recognition) model or an LLM call to extract key entities from the user's query.

Coreference Resolution: This is critical for multi-turn dialogue. Identify pronouns like "it," "they," and "that" and resolve them to the specific entities they refer to from the conversation's history. This is where the graph shines. We can query the graph for recently mentioned entities to provide candidates for resolution.

python

import spacy

# Load a model with a coreference resolution component
nlp = spacy.load("en_core_web_trf")

def process_user_query(session_id: str, query_text: str, graph_driver):
    """Extracts entities and resolves coreferences using graph context."""
    
    # 1. Get recent entities from the graph for context
    with graph_driver.session() as session:
        result = session.run("""
            MATCH (s:Session {id: $session_id})-[:HAS_TURN]->(t:Turn)
            -[:HAS_QUERY|HAS_RESPONSE]->(m)-[:MENTIONS]->(e:Entity)
            WITH t, e ORDER BY t.timestamp DESC
            RETURN e.name AS entity_name
            LIMIT 10
        """, session_id=session_id)
        recent_entities = [record["entity_name"] for record in result]

    # This is a simplified example. A real implementation might use a more
    # sophisticated prompt or model for coreference resolution.
    # The key is feeding the model the list of candidate entities.
    contextual_query = f"""
    Recent entities mentioned: {', '.join(recent_entities)}.
    Resolve pronouns in the following user query: '{query_text}'
    """
    # In a real app, you would send this to an LLM to get a resolved query.
    # For this example, let's assume a simple replacement.
    # e.g., if recent_entities included 'Project Titan' and query was 'How does it scale?',
    # the resolved query would be 'How does Project Titan scale?'
    resolved_query_text = resolve_coreferences_with_llm(contextual_query) # Placeholder for LLM call
    
    doc = nlp(resolved_query_text)
    entities = [ent.text for ent in doc.ents]
    
    return resolved_query_text, entities

# Placeholder for a function that would call an LLM
def resolve_coreferences_with_llm(prompt: str) -> str:
    # In a real implementation, you'd use OpenAI, Anthropic, etc.
    # to process the prompt and return the resolved query text.
    # For now, we'll just return the original query part.
    return prompt.split("user query: '")[1].split("'")[0]

Step 2: Hybrid Context Retrieval (Graph + Vector)

Now we retrieve context using a two-pronged approach:

Graph Retrieval (Explicit Context): We query the graph to find entities, documents, and past turns that are explicitly* linked to the entities in our current query. This provides precise, structured context.

* Vector Retrieval (Semantic Context): We embed the resolved_query_text and perform a similarity search in our vector database. This catches related concepts that aren't directly linked in the graph.

Here's the Cypher query for graph retrieval. It starts from the current session, finds entities mentioned in the new query, and then traverses outwards to find related entities and the document chunks where they were originally mentioned.

python

def retrieve_graph_context(graph_driver, session_id: str, entities: list[str]):
    cypher_query = """
    // Find entities mentioned in the current query
    MATCH (e:Entity) WHERE e.name IN $entities
    WITH collect(e) as current_entities
    
    // Find other entities related to the current ones
    MATCH (e) WHERE e in current_entities
    OPTIONAL MATCH (e)-[:RELATED_TO]-(related_e:Entity)
    WITH current_entities + collect(related_e) as all_relevant_entities
    UNWIND all_relevant_entities as relevant_entity
    
    // Find document chunks where these relevant entities were mentioned
    // and prioritize chunks linked to recent turns in this session.
    MATCH (chunk:DocumentChunk)<-[:SOURCED_FROM]-(resp:Response)<-[:HAS_RESPONSE]-(t:Turn)
    -[:HAS_TURN*0..5]-(s:Session {id: $session_id})
    WHERE (resp)-[:MENTIONS]->(relevant_entity)
    
    RETURN DISTINCT chunk.text as text, 1.0 AS score // Give high score to direct graph hits
    LIMIT 5
    
    UNION
    
    // Fallback: find chunks related to the entities, not tied to this session
    MATCH (e:Entity)<-[:MENTIONS]-(resp:Response)-[:SOURCED_FROM]->(chunk:DocumentChunk)
    WHERE e.name IN $entities
    RETURN DISTINCT chunk.text as text, 0.5 AS score
    LIMIT 5
    """
    with graph_driver.session() as session:
        result = session.run(cypher_query, entities=entities, session_id=session_id)
        return [record["text"] for record in result]

This query is powerful. It prioritizes information that has been relevant in this specific conversation before falling back to general knowledge, perfectly mirroring human conversational recall.

Step 3: Merging and Re-ranking Contexts

We now have two sets of context chunks: one from the graph and one from the vector DB. Simply concatenating them is suboptimal. A better approach is to use a re-ranking model (e.g., a cross-encoder) to score all retrieved chunks against the user's query and select the top-k most relevant results from the combined pool. This ensures the final context is dense with relevant information.

python

from sentence_transformers.cross_encoder import CrossEncoder

# This would be loaded once in your application
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_contexts(query: str, contexts: list[str]):
    pairs = [[query, ctx] for ctx in contexts]
    scores = cross_encoder.predict(pairs)
    
    scored_contexts = zip(scores, contexts)
    scored_contexts = sorted(scored_contexts, key=lambda x: x[0], reverse=True)
    
    # Return the top N contexts
    return [ctx for score, ctx in scored_contexts[:5]]

# In our main flow:
graph_context = retrieve_graph_context(...)
vector_context = retrieve_vector_context(...) # Assume this function exists

combined_context = list(set(graph_context + vector_context)) # De-duplicate
final_context = rerank_contexts(resolved_query_text, combined_context)

Step 4: Prompt Engineering and LLM Invocation

We construct a prompt that leverages our hard-won context. The structure should clearly delineate the different types of information.

text

SYSTEM: You are a helpful AI assistant with a persistent memory of this conversation.

CONVERSATIONAL CONTEXT GRAPH:
The user has previously discussed these key entities: ["Project Titan", "Kubernetes"]. The last response was about Horizontal Pod Autoscaling (HPA).

RETRIEVED KNOWLEDGE:
--- 
Context 1: [Text of the first re-ranked document chunk]
---
Context 2: [Text of the second re-ranked document chunk]
---

USER QUERY:
{resolved_query_text}

ASSISTANT RESPONSE:

This structured prompt helps the LLM differentiate between conversational history and retrieved documents, leading to more accurate and contextually-aware responses.

Step 5: Updating the Graph (Closing the Loop)

This final step is what makes the system stateful. After the LLM generates a response, we parse it and update our graph to incorporate the new information from this turn.

This must be done in a single atomic transaction to ensure data integrity.

python

def update_graph_with_turn(driver, session_id: str, query_text: str, query_entities: list, response_text: str, used_chunks: list):
    # First, extract entities from the LLM's response
    doc = nlp(response_text)
    response_entities = [ent.text for ent in doc.ents]
    all_entities = list(set(query_entities + response_entities))

    cypher_transaction = """
    // Ensure all entities exist
    UNWIND $all_entities as entity_name
    MERGE (e:Entity {name: entity_name})
    
    // Find the session and the last turn
    MATCH (s:Session {id: $session_id})
    OPTIONAL MATCH (s)-[:HAS_TURN]->(last_turn:Turn)
    WHERE NOT (last_turn)-[:NEXT_TURN]->()
    
    // Create the new turn, query, and response nodes
    CREATE (t:Turn {timestamp: timestamp()})
    CREATE (q:Query {text: $query_text})
    CREATE (r:Response {text: $response_text})
    
    // Link them together
    CREATE (s)-[:HAS_TURN]->(t)
    CREATE (t)-[:HAS_QUERY]->(q)
    CREATE (t)-[:HAS_RESPONSE]->(r)
    
    // If there was a last turn, link it to the new one
    WITH t, q, r, last_turn
    FOREACH (lt IN CASE WHEN last_turn IS NULL THEN [] ELSE [last_turn] END |
        CREATE (lt)-[:NEXT_TURN]->(t)
    )
    
    // Link query to its entities
    WITH t, q, r
    UNWIND $query_entities as entity_name
    MATCH (e:Entity {name: entity_name})
    MERGE (q)-[:MENTIONS]->(e)
    
    // Link response to its entities
    WITH t, q, r
    UNWIND $response_entities as entity_name
    MATCH (e:Entity {name: entity_name})
    MERGE (r)-[:MENTIONS]->(e)
    
    // Link response to the document chunks it was sourced from
    WITH t, q, r
    UNWIND $used_chunks as chunk_text
    MATCH (c:DocumentChunk {text: chunk_text})
    MERGE (r)-[:SOURCED_FROM]->(c)
    """
    
    with driver.session() as session:
        session.execute_write(lambda tx: tx.run(cypher_transaction, 
            session_id=session_id,
            query_text=query_text,
            response_text=response_text,
            query_entities=query_entities,
            response_entities=response_entities,
            all_entities=all_entities,
            used_chunks=used_chunks
        ))

This transactional update ensures that each turn of the conversation is atomically added to the graph, growing the conversational memory and making it available for subsequent turns.

Advanced Considerations and Production Edge Cases

Deploying this architecture requires handling several complex scenarios.

1. Context Pruning and Summarization

A long-running conversation can lead to a massive graph, slowing down queries. We need strategies to manage its size:

Time-based Windowing: In the retrieval query, limit the traversal depth (e.g., (s)-[:HAS_TURN0..10]->(t) to only consider the last 10 turns).

* Entity Salience Scoring: Assign a salience score to entities based on frequency and recency of mentions. Deprioritize or ignore low-salience entities during retrieval.

* Conversational Summarization: For very long conversations, periodically use an LLM to summarize a branch of the conversation. Create a Summary node that condenses a series of Turn nodes, and link it to the key entities from that segment. The retrieval query can then pull from these summaries instead of traversing every single turn.

2. Ambiguity Resolution

What if the user mentions "Apple"? The company or the fruit? The graph provides the context needed to disambiguate.

Modify the retrieval query to check the neighborhood of candidate entities. If a query mentions "Apple" and "iPhone", the query can favor the Entity node for 'Apple' that is already RELATED_TO the Entity for 'iPhone'.

cypher

// Cypher snippet for disambiguation
MATCH (e:Entity {name: "Apple"})
// Check for relationships to other entities in the query
MATCH (other_e:Entity) WHERE other_e.name IN ["iPhone", "Tim Cook"]

// Score the 'Apple' entity higher if it's connected
WITH e, COUNT((e)-[:RELATED_TO]-(other_e)) as score
RETURN e ORDER BY score DESC LIMIT 1

3. Handling Topic Drift

Users often change subjects abruptly. Our system should recognize this. We can detect a topic drift by analyzing the overlap of entities between the current turn and the previous N turns. If the Jaccard similarity of the entity sets is below a certain threshold, we can infer a topic change.

Upon detection, we can either:

a) Start a new "branch" in the graph, creating a TopicSegment node.

b) Heavily down-weight the context from the previous topic during retrieval.

4. Performance and Scalability

* Query Optimization: Every Cypher query must be profiled (PROFILE or EXPLAIN). Ensure proper indexes are created on key node properties, especially Session(id) and Entity(name). A composite index on (label, property) is often necessary.

* Write Latency: The graph update in Step 5 adds latency to each turn. For applications requiring real-time responses, this write can be performed asynchronously. The user gets their response immediately, and the graph is updated in the background. The trade-off is that the very next, immediate query might not have access to the absolute latest turn's context.

* Caching: Implement a caching layer (like Redis) for frequently accessed data, such as the entity list for active sessions. This can reduce read load on the graph database.

Conclusion: Beyond Stateless RAG

Building a stateful RAG system is a significant architectural undertaking. It moves beyond simple vector search and demands a more holistic approach to context management. By modeling conversations as a dynamic, evolving knowledge graph, we create a persistent memory that enables LLMs to engage in truly coherent, multi-turn dialogue.

The patterns discussed here—hybrid retrieval, coreference resolution using graph context, and transactional state updates—provide a blueprint for the next generation of conversational AI. While the implementation complexity is higher than stateless RAG, the payoff is an AI agent that can remember, reason about, and reference past interactions, transforming it from a simple information retrieval tool into a genuine conversational partner.

The Achilles' Heel of Standard RAG: Conversational Amnesia

The Graph-Based Context Model: From Chat Log to Knowledge Graph

Why a Graph Database?

Designing the Conversational Graph Schema

The Stateful RAG Loop: A Five-Step Implementation

Step 1: Ingestion and Coreference Resolution

Step 2: Hybrid Context Retrieval (Graph + Vector)

Step 3: Merging and Re-ranking Contexts

Step 4: Prompt Engineering and LLM Invocation

Step 5: Updating the Graph (Closing the Loop)

Advanced Considerations and Production Edge Cases

1. Context Pruning and Summarization

2. Ambiguity Resolution

3. Handling Topic Drift

4. Performance and Scalability

Conclusion: Beyond Stateless RAG

Found this article helpful?