Graph Reranking in Hybrid Search for Multi-Hop RAG Systems

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Architectural Ceiling of Naive RAG in Complex QA

As senior engineers, we've moved beyond the 'Hello, World' of Retrieval-Augmented Generation (RAG). We understand that embedding a corpus and performing cosine similarity searches is table stakes. The real challenge—the one that separates proof-of-concepts from production-grade AI systems—is handling queries that require reasoning and synthesis across multiple information sources. These are often called multi-hop or compositional queries.

Consider this query: "What was the market cap of the company that acquired the startup founded by the creator of the 'Chroma' data visualization library?"

A standard RAG system will likely fail here. It might embed the entire query and find documents semantically similar to "market cap," "company acquisitions," or "Chroma library." However, it has no intrinsic mechanism to follow the logical chain:

  • Find the document about the 'Chroma' library to identify its creator.
  • Find documents about that creator to identify the startup they founded.
  • Find documents about that startup to identify the acquiring company.
  • Find financial documents about the acquiring company to get its market cap.

Naive semantic search retrieves a flat list of documents, often missing crucial intermediate links. It optimizes for semantic relevance to the query as a whole, not for the relational path required to construct the answer. This is the architectural ceiling we need to break.

This article details a production-proven, multi-stage pattern that explicitly addresses this challenge: Hybrid Search followed by Dynamic Graph-based Reranking. We will not just retrieve documents; we will build a transient relational model of the retrieved information space to identify the most salient context for the LLM.

The Multi-Stage Retrieval Architecture

Our advanced RAG pipeline will consist of three distinct phases:

  • Broad Retrieval (Recall-Optimized): Use a hybrid search approach combining keyword-based (BM25) and vector-based (Dense) search to cast a wide but relevant net. This ensures we capture both lexical matches (specific names like 'Chroma') and semantic concepts.
  • Graph Reranking (Precision-Optimized): This is the core of our solution. We take the top N candidates from Phase 1, dynamically extract entities and relationships from them, and build an in-memory knowledge graph. We then use graph centrality algorithms to rerank the documents based on how well they connect the key entities in the query.
  • Synthesis (LLM Augmentation): Provide the reranked, highly-relevant, and logically ordered documents to the LLM to generate the final, faithful answer.
  • Let's dive into the implementation details.


    Phase 1: Broad Retrieval with Hybrid Search

    Starting with a solid foundation is critical. Relying solely on dense vector search is a common pitfall. It struggles with specific keywords, acronyms, and out-of-domain terms. Hybrid search provides a robust solution.

    We'll use the Weaviate vector database for this example, as it has excellent native support for hybrid search. Assume we have a collection of documents (news articles, financial reports, technical blogs) ingested into a Weaviate class named Article.

    Weaviate Schema and Data Ingestion

    First, ensure your Weaviate schema is configured for both dense vectors (using a transformer model) and keyword indexing.

    python
    import weaviate
    import weaviate.classes.config as wvc
    import os
    
    # Connect to Weaviate instance
    client = weaviate.connect_to_local(
        headers={"X-OpenAI-Api-Key": os.getenv("OPENAI_APIKEY")}
    )
    
    # Define the collection schema
    if client.collections.exists("Article"):
        client.collections.delete("Article")
    
    articles = client.collections.create(
        name="Article",
        vectorizer_config=wvc.Configure.Vectorizer.text2vec_openai(),
        generative_config=wvc.Configure.Generative.openai(),
        properties=[
            wvc.Property(name="title", data_type=wvc.DataType.TEXT),
            wvc.Property(name="content", data_type=wvc.DataType.TEXT),
            wvc.Property(name="source", data_type=wvc.DataType.TEXT),
        ]
    )
    
    # Example data ingestion (in a real scenario, this would be a large corpus)
    docs = [
        {"title": "Chroma.js Creator John Doe Launches 'GraphiQL' Startup", "content": "John Doe, the creator of the popular open-source data visualization library Chroma.js, has announced the launch of his new startup, 'GraphiQL'. The company aims to build developer tools for GraphQL APIs.", "source": "TechCrunch_2021"},
        {"title": "InnovateCorp Acquires GraphQL Tooling Startup GraphiQL for $500M", "content": "Enterprise software giant InnovateCorp today announced its acquisition of GraphiQL, a promising startup in the GraphQL developer space. The deal is valued at approximately $500 million.", "source": "Reuters_2022"},
        {"title": "InnovateCorp (INVC) Reports Strong Q4 Earnings, Market Cap Soars to $150B", "content": "InnovateCorp's latest earnings report exceeded analyst expectations, driving its stock price up. The company's market capitalization now stands at a robust $150 billion.", "source": "Bloomberg_2023"},
        {"title": "A Deep Dive into Data Visualization with Chroma.js", "content": "Chroma.js offers a powerful and flexible API for color manipulation and visualization. Created by John Doe, it has become a staple for web developers.", "source": "SmashingMagazine_2020"},
        # ... add more distractor documents
        {"title": "The Rise of Vector Databases", "content": "Vector databases are becoming essential for AI applications, especially in semantic search and RAG.", "source": "AI_Weekly_2023"}
    ]
    
    articles.data.insert_many(docs)
    
    print("Data ingestion complete.")

    Implementing the Hybrid Search Function

    Now, we'll write a function to perform the hybrid search. We'll retrieve a relatively large number of candidates (e.g., k=50) to ensure our graph construction phase has enough material to work with. The alpha parameter balances between BM25 (alpha=1.0) and vector search (alpha=0.0). An alpha of 0.5 gives equal weight to both.

    python
    from typing import List, Dict
    
    def hybrid_search(query: str, k: int = 50, alpha: float = 0.5) -> List[Dict]:
        """
        Performs a hybrid search on the 'Article' collection.
    
        Args:
            query (str): The user's query.
            k (int): The number of documents to retrieve.
            alpha (float): The weighting factor for hybrid search. 
                           1.0 is pure BM25, 0.0 is pure vector.
    
        Returns:
            List[Dict]: A list of retrieved document objects.
        """
        articles = client.collections.get("Article")
        
        response = articles.query.hybrid(
            query=query,
            limit=k,
            alpha=alpha,
            query_properties=["title", "content"]
        )
        
        retrieved_docs = []
        for item in response.objects:
            retrieved_docs.append({
                "id": str(item.uuid),
                "title": item.properties["title"],
                "content": item.properties["content"],
                "source": item.properties["source"],
            })
        
        return retrieved_docs
    
    # Example usage for our multi-hop query
    multi_hop_query = "What was the market cap of the company that acquired the startup founded by the creator of the 'Chroma' data visualization library?"
    
    candidate_docs = hybrid_search(multi_hop_query)
    
    print(f"Retrieved {len(candidate_docs)} candidate documents for graph construction.")
    # This will retrieve our 3 key documents plus many other distractor docs.

    At this point, we have a list of candidate documents. A naive RAG system would simply concatenate their content and pass it to an LLM. We, however, will proceed to the crucial reranking phase.


    Phase 2: Dynamic Graph Construction & Reranking

    This phase is where we build our transient analytical model. The hypothesis is that documents forming a connected path between the query's key entities are more important than documents that are merely semantically similar in isolation.

    2a. On-the-Fly Entity and Relationship Extraction

    We need to process the text of our candidate_docs to extract structured information. We'll use an LLM for this task, leveraging function calling for reliable, structured output. A fine-tuned, smaller open-source model could be a cost-effective alternative in production.

    First, let's define our data structures using Pydantic for validation.

    python
    from pydantic import BaseModel, Field
    from typing import List, Optional
    
    class Entity(BaseModel):
        name: str = Field(description="The extracted entity name, e.g., 'John Doe', 'InnovateCorp'")
        type: str = Field(description="The entity type, e.g., 'PERSON', 'ORGANIZATION', 'PRODUCT', 'TECHNOLOGY'")
    
    class Relationship(BaseModel):
        source: str = Field(description="The name of the source entity")
        target: str = Field(description="The name of the target entity")
        type: str = Field(description="The type of relationship, e.g., 'FOUNDED', 'ACQUIRED_BY', 'CREATOR_OF'")
    
    class ExtractedGraph(BaseModel):
        entities: List[Entity]
        relationships: List[Relationship]

    Now, we create a function that takes a document's text and uses an LLM to extract the graph structure.

    python
    import instructor
    from openai import OpenAI
    
    # Patch the OpenAI client with instructor for structured output
    client_instructor = instructor.patch(OpenAI(api_key=os.getenv("OPENAI_APIKEY")))
    
    def extract_graph_from_text(text: str) -> Optional[ExtractedGraph]:
        """
        Uses an LLM with function calling to extract entities and relationships from text.
        """
        try:
            graph = client_instructor.chat.completions.create(
                model="gpt-4-turbo-preview",
                response_model=ExtractedGraph,
                messages=[
                    {
                        "role": "system",
                        "content": "You are an expert entity and relationship extraction system. Extract all relevant entities (people, organizations, products, technologies) and the relationships between them from the provided text. The relationships should be directed, like 'CREATOR_OF', 'FOUNDED', 'ACQUIRED_BY'."
                    },
                    {
                        "role": "user",
                        "content": text
                    }
                ]
            )
            return graph
        except Exception as e:
            print(f"Error during extraction: {e}")
            return None

    2b. Building the Graph with NetworkX

    With our extraction function ready, we can iterate through the candidate documents, extract their structures, and aggregate them into a single networkx graph. We use a MultiDiGraph to allow for multiple types of relationships between the same two entities.

    python
    import networkx as nx
    from tqdm import tqdm
    
    def build_knowledge_graph(documents: List[Dict]) -> nx.MultiDiGraph:
        """
        Builds a knowledge graph from a list of documents.
        """
        G = nx.MultiDiGraph()
        doc_to_entities_map = {}
    
        for doc in tqdm(documents, desc="Building Knowledge Graph"):
            doc_id = doc['id']
            # In production, consider batching these calls
            extracted_data = extract_graph_from_text(doc['content'])
            
            if extracted_data:
                doc_entities = []
                for entity in extracted_data.entities:
                    G.add_node(entity.name, type=entity.type)
                    doc_entities.append(entity.name)
                
                doc_to_entities_map[doc_id] = doc_entities
    
                for rel in extracted_data.relationships:
                    # Ensure both source and target nodes exist before adding an edge
                    if G.has_node(rel.source) and G.has_node(rel.target):
                        G.add_edge(rel.source, rel.target, key=rel.type, label=rel.type)
        
        return G, doc_to_entities_map
    
    # Build the graph from our candidate documents
    knowledge_graph, doc_map = build_knowledge_graph(candidate_docs)
    
    # You can visualize the graph to verify (requires matplotlib)
    # import matplotlib.pyplot as plt
    # pos = nx.spring_layout(knowledge_graph, k=0.9, iterations=50)
    # nx.draw(knowledge_graph, pos, with_labels=True, node_size=1500, node_color='skyblue', font_size=10)
    # edge_labels = nx.get_edge_attributes(knowledge_graph, 'label')
    # nx.draw_networkx_edge_labels(knowledge_graph, pos, edge_labels=edge_labels)
    # plt.show()

    This process creates a graph where nodes are entities and edges are relationships. Crucially, we also maintain a doc_map that links each document ID to the entities it contains.

    2c. Reranking with Graph Centrality

    The final step in this phase is to use the graph's structure to score our documents. A simple yet powerful approach is to use a centrality algorithm like PageRank. The idea is that entities that are highly connected and central to the graph are more important. Documents containing these central entities should be ranked higher.

    We can enhance PageRank by providing a personalization vector. This biases the random walk to prefer nodes that are directly related to the entities found in the original query.

    python
    import numpy as np
    
    def rerank_docs_with_graph(documents: List[Dict], graph: nx.MultiDiGraph, doc_map: Dict, query: str) -> List[Dict]:
        """
        Reranks documents based on the PageRank centrality of the entities they contain.
        """
        if not graph.nodes or not doc_map:
            return documents # Fallback to original order if graph is empty
    
        # 1. Identify key entities in the original query (can use a simpler LLM call or regex)
        # For this example, we'll manually identify them for clarity
        query_entities = ['Chroma', 'company', 'startup', 'creator'] # Simplified
        
        # 2. Build personalization vector for PageRank
        personalization = {}
        for node in graph.nodes:
            # Boost nodes that are similar to query entities
            # A more advanced approach would use embeddings for similarity
            if any(qe.lower() in node.lower() for qe in query_entities):
                personalization[node] = 1.0
            else:
                personalization[node] = 0.1 # Small base value for all nodes
    
        # 3. Calculate PageRank
        try:
            pagerank_scores = nx.pagerank(graph, alpha=0.85, personalization=personalization)
        except nx.PowerIterationFailedConvergence:
            # Fallback for disconnected graphs
            pagerank_scores = {node: 1.0 / len(graph.nodes) for node in graph.nodes}
    
        # 4. Score documents based on the max PageRank of their entities
        doc_scores = {}
        for doc_id, entities in doc_map.items():
            if not entities:
                doc_scores[doc_id] = 0
                continue
            
            max_entity_score = 0
            for entity in entities:
                if entity in pagerank_scores:
                    max_entity_score = max(max_entity_score, pagerank_scores[entity])
            
            doc_scores[doc_id] = max_entity_score
    
        # 5. Sort documents by their new scores
        doc_dict = {doc['id']: doc for doc in documents}
        sorted_doc_ids = sorted(doc_scores.keys(), key=lambda k: doc_scores[k], reverse=True)
        
        reranked_docs = [doc_dict[doc_id] for doc_id in sorted_doc_ids if doc_id in doc_dict]
        
        return reranked_docs
    
    # Perform the reranking
    reranked_documents = rerank_docs_with_graph(candidate_docs, knowledge_graph, doc_map, multi_hop_query)
    
    print("Top 5 Reranked Documents:")
    for i, doc in enumerate(reranked_documents[:5]):
        print(f"{i+1}. {doc['title']} (Source: {doc['source']})")

    After this step, the documents containing InnovateCorp, GraphiQL, and John Doe will be pushed to the top, as they form the central backbone of the constructed graph, while distractor documents will be pushed down.


    Phase 3: Synthesis with the Reranked Context

    Now that we have a highly relevant and logically ordered set of documents, the final step is straightforward but requires careful prompt engineering.

    Instead of a messy, unordered context blob, we can present the top k (e.g., k=5) reranked documents to the LLM. It's crucial to instruct the model to synthesize the answer based only on the provided context and to cite its sources.

    python
    def generate_final_answer(query: str, context_docs: List[Dict], top_k: int = 5) -> str:
        """
        Generates the final answer using the reranked context.
        """
        context = ""
        for i, doc in enumerate(context_docs[:top_k]):
            context += f"Source [{i+1}]: {doc['source']}\nTitle: {doc['title']}\nContent: {doc['content']}\n---\n"
        
        prompt = f"""
        You are a question-answering AI. Answer the following user query based ONLY on the provided sources. 
        Synthesize the information from the multiple sources to construct a coherent answer. 
        If the answer cannot be found in the sources, state that clearly.
        Cite the sources used for each part of your answer, for example [1], [2], etc.
    
        USER QUERY:
        {query}
    
        SOURCES:
        {context}
    
        ANSWER:
        """
    
        response = client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are a helpful question-answering assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.0
        )
        
        return response.choices[0].message.content
    
    # Generate the answer
    final_answer = generate_final_answer(multi_hop_query, reranked_documents)
    
    print("\nFinal Answer:\n", final_answer)

    Expected Output:

    The creator of the 'Chroma' data visualization library is John Doe [4]. He founded a startup named 'GraphiQL' [1]. This startup was acquired by InnovateCorp [2]. InnovateCorp's market capitalization later reached $150 billion [3].

    This answer correctly synthesizes information from multiple documents, demonstrating the power of our multi-stage retrieval process.


    Production Considerations & Edge Cases

    Deploying this system requires addressing several practical challenges:

  • Latency: The on-the-fly graph extraction is the primary bottleneck. Each LLM call adds latency.
  • * Mitigation:

    * Batching: Send multiple documents to the LLM in a single request for extraction.

    * Smaller Models: Use a fine-tuned, smaller, and faster model (e.g., a 7B parameter model) specifically for the NER task. This can be significantly cheaper and faster than a GPT-4 call.

    * Asynchronous Processing: If the application can tolerate a slight delay, perform the graph extraction and reranking asynchronously while showing the user a preliminary result from the initial hybrid search.

  • Cost: Multiple LLM calls for extraction can be expensive.
  • * Mitigation:

    * Aggressive Caching: Cache the extracted graph structures for each document. If a document is retrieved again, use the cached version instead of re-running extraction.

    * Selective Extraction: Instead of running extraction on all 50 candidate docs, run it on the top 15-20, which are most likely to be relevant.

  • Disconnected Graphs: What if the retrieved documents share no common entities? The graph will be a set of disconnected components.
  • * Handling: Our rerank_docs_with_graph function includes a try...except block to handle nx.PowerIterationFailedConvergence, which can occur in such cases. The fallback is to either return the original hybrid search ranking or use a simpler scoring metric, like the number of entities a document contains.

  • Entity Ambiguity & Linking: The system might extract "Apple" (company) and "apple" (fruit) as the same node.
  • * Solution: For higher accuracy, implement an Entity Linking step. After extraction, attempt to link each entity to a canonical entry in a knowledge base like Wikidata. This normalizes entities (InnovateCorp, Innovate Corp., InnovateCorp Inc.) to a single identifier, making the graph more robust.

    Benchmarking the Improvement

    To quantify the value of this approach, a proper evaluation is necessary.

    * Dataset: Create a 'golden dataset' of 20-50 multi-hop questions where the required documents are known to exist in your corpus.

    * Metrics:

    * Context Recall@K: For a given question, is all the necessary information present in the top K documents sent to the LLM? Compare naive RAG vs. graph-reranked RAG.

    * Answer Faithfulness: Use an LLM-as-a-judge to evaluate if the generated answer is fully supported by the provided context. Our graph-reranked context should lead to higher faithfulness scores.

    * End-to-End Answer Accuracy: Human evaluation of the final answers.

    Hypothetical Benchmark Results:

    MethodContext Recall@5Answer FaithfulnessLatency (p95)
    Naive Vector Search RAG35%60%1.5s
    Hybrid Search RAG60%75%1.8s
    Hybrid + Graph Rerank95%98%4.5s

    This benchmark illustrates the classic trade-off: our advanced method significantly improves quality at the cost of increased latency, highlighting the importance of the optimization strategies discussed above.

    Conclusion

    Simple semantic search is insufficient for the next generation of AI applications that require deep reasoning. By treating retrieval not as a single step but as a multi-stage pipeline, we can overcome the limitations of naive RAG. The Hybrid Search + Dynamic Graph Reranking pattern provides a powerful and flexible way to model the relationships within retrieved information, ensuring that the context provided to the LLM is not just relevant but also relationally coherent. While it introduces complexity and latency, the dramatic improvement in answering multi-hop, compositional queries makes it an essential technique for any senior engineer building sophisticated knowledge-based systems.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles