Production RAG: Semantic Caching for Sub-50ms LLM Responses

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Unavoidable Latency Tax in Production RAG

In any production-grade Retrieval-Augmented Generation (RAG) system, the end-to-end latency is a composite of multiple, often expensive, operations. A typical request flow introduces latency at every step:

  • Query Embedding: The user's input query must be converted into a vector embedding. This involves a network call to an embedding model API (like OpenAI's text-embedding-3-small) or inference on a self-hosted model. Latency: 50-200ms.
  • Vector Search: The query vector is used to perform an Approximate Nearest Neighbor (ANN) search against a large index of document chunk embeddings in a vector database (e.g., Pinecone, Weaviate, Milvus). This includes network latency to the DB and the computational cost of the search itself. Latency: 50-300ms.
  • Context Augmentation & Prompt Engineering: The retrieved document chunks are formatted and injected into a prompt template alongside the original query.
  • LLM Generation: The final, augmented prompt is sent to a Large Language Model (LLM) for the final answer generation. This is almost always the most significant contributor to latency. Latency: 1000-5000ms+, depending on the model and output token length.
  • Summing these, a 'fast' RAG response is often in the 2-3 second range, with slower responses easily exceeding 5-10 seconds. For any user-facing, interactive application, this is a non-starter. The solution is not merely optimizing individual components but architecting a system that can bypass the entire pipeline for a significant portion of incoming queries. This is where a sophisticated caching layer becomes a system requirement, not an enhancement.


    Architecting a Multi-Layered Semantic Cache

    A naive cache might use the raw query string as a key in a simple key-value store like Redis. This fails in practice because users ask the same semantic question in myriad ways ("how do I reset my password?" vs. "I forgot my password, what do I do?"). Our architecture must be resilient to lexical variations and understand semantic intent. We will construct a two-layer cache that sits in front of the main RAG pipeline.

    Request Flow with Caching:

    mermaid
    graph TD
        A[User Query] --> B{Layer 1: Lexical Cache Check};
        B -- Hit --> C[Return Cached Response];
        B -- Miss --> D{Layer 2: Semantic Cache Check};
        D -- Hit --> C;
        D -- Miss --> E[Execute Full RAG Pipeline];
        E --> F[LLM Generates Response];
        F --> G{Cache Write Logic};
        G --> H[Return Response to User];
        G --> I[Populate L1 & L2 Caches];

    Layer 1: The Lexical Cache (Exact Match)

    This is our first line of defense. It's a standard key-value cache that maps a canonical representation of a query to a previously generated answer.

    * Store: Redis is an ideal choice due to its low latency and TTL capabilities.

    * Key Generation: The key should not be the raw query string. Instead, we create a normalized, canonical key. A robust function would involve lowercasing, removing punctuation, and sorting words to handle minor rephrasing.

    Implementation:

    python
    import redis
    import hashlib
    import json
    
    # Connect to Redis
    redis_client = redis.Redis(host='localhost', port=6379, db=0, decode_responses=True)
    
    def get_lexical_cache_key(query: str) -> str:
        """Creates a normalized, order-agnostic key for the query."""
        normalized_query = ''.join(e for e in query.lower() if e.isalnum() or e.isspace())
        # Sorting words makes it resilient to word order changes
        sorted_words = sorted(normalized_query.split())
        canonical_query = " ".join(sorted_words)
        # Use a hash to ensure a fixed-length key
        return f"lexical_cache:{hashlib.sha256(canonical_query.encode()).hexdigest()}"
    
    def get_from_lexical_cache(query: str):
        """Attempt to retrieve a response from the lexical cache."""
        key = get_lexical_cache_key(query)
        cached_result = redis_client.get(key)
        if cached_result:
            print("--- Lexical Cache HIT ---")
            return json.loads(cached_result)
        print("--- Lexical Cache MISS ---")
        return None
    
    def set_in_lexical_cache(query: str, response: dict, ttl_seconds: int = 3600):
        """Store a response in the lexical cache with a TTL."""
        key = get_lexical_cache_key(query)
        redis_client.setex(key, ttl_seconds, json.dumps(response))
        print(f"--- Lexical Cache SET for key: {key} ---")
    
    # Example Usage
    query1 = "How can I reset my account password?"
    query2 = "reset my password for my account how?"
    
    # First request (miss)
    response = get_from_lexical_cache(query1)
    
    # Assume we run the full RAG pipeline and get a response
    if not response:
        rag_response = {"answer": "To reset your password, go to the settings page and click 'Reset Password'.", "source_docs": ["doc1.txt"]}
        set_in_lexical_cache(query1, rag_response)
    
    # Second, semantically identical request (now a hit)
    response = get_from_lexical_cache(query2)
    print(response)

    Edge Case: This normalization is lossy. "apple stock price" and "stock price apple" normalize to the same key, which is desired. However, more complex queries can lose nuance. This layer is optimized for speed and high-precision hits on very common, simple queries. Its limitations are why we need Layer 2.

    Layer 2: The Semantic Cache

    This is the core of our high-performance caching strategy. It doesn't match strings; it matches semantic meaning by comparing vector embeddings.

    Architecture:

  • Cache Vector Index: A separate, dedicated vector index (e.g., in-memory FAISS, or a dedicated Pinecone/Weaviate index) that stores only the embeddings of queries that have been answered.
  • Answer Store: A key-value store (we can reuse Redis) that maps a query's unique ID to its full generated answer.
  • Read Path:
  • a. Embed the incoming user query.

    b. Search the cache vector index for semantically similar historical queries.

    c. If a historical query is found above a certain similarity threshold (e.g., cosine similarity > 0.98), retrieve its full answer from the Answer Store and return it.

  • Write Path:
  • a. After a full RAG pipeline execution (a cache miss), generate a unique ID for the new query-answer pair.

    b. Store the full answer in the Answer Store with the new ID.

    c. Add the new query's embedding and its ID to the Cache Vector Index.

    Choosing the Right Tools:

    Embedding Model: The cache embedding model can and should* be different from your main RAG pipeline's model. For the cache, speed is paramount. A smaller, faster model like sentence-transformers/all-MiniLM-L6-v2 is an excellent choice. It runs locally, has low latency, and is highly effective for semantic similarity tasks.

    * Vector Index: For ultimate speed, an in-memory index using faiss-cpu is a great option if the number of cached queries is manageable (e.g., < 1 million). For larger scales or distributed systems, a managed vector DB is more practical.

    Implementation with FAISS and SentenceTransformers:

    python
    import faiss
    import numpy as np
    from sentence_transformers import SentenceTransformer
    import uuid
    import json
    import redis
    
    # --- Initialization ---
    
    # 1. Load a fast embedding model
    print("Loading embedding model...")
    cache_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    embedding_dim = cache_embedding_model.get_sentence_embedding_dimension()
    
    # 2. Initialize FAISS index
    # Using IndexFlatL2 is simple, but for production, IndexHNSWFlat is faster for search.
    # We will use IndexIDMap to map FAISS's internal IDs to our own UUIDs.
    index = faiss.IndexFlatL2(embedding_dim)
    index_map = faiss.IndexIDMap(index)
    
    # 3. Connect to Redis for the Answer Store
    redis_client = redis.Redis(host='localhost', port=6379, db=1, decode_responses=True)
    
    # --- Global State (in a real app, this would be managed properly) ---
    # In-memory mapping from our integer IDs to string UUIDs for this example
    # A production system might use a persistent store or a different FAISS index type
    faiss_id_counter = 0
    id_to_uuid_map = {}
    
    # --- Core Functions ---
    
    def search_semantic_cache(query: str, threshold: float = 0.98):
        """Search for a semantically similar query in the cache."""
        global index_map, id_to_uuid_map
    
        if index_map.ntotal == 0:
            print("--- Semantic Cache is empty. MISS ---")
            return None
    
        query_embedding = cache_embedding_model.encode([query], convert_to_numpy=True)
        # FAISS uses L2 distance, SentenceTransformers embeddings are normalized to length 1.
        # For normalized vectors, L2_distance^2 = 2 * (1 - cosine_similarity).
        # So, we search for a small L2 distance.
        distance_threshold = np.sqrt(2 * (1 - threshold))
    
        # Search for the 1 nearest neighbor
        distances, ids = index_map.search(query_embedding, k=1)
    
        if ids[0][0] != -1 and distances[0][0] < distance_threshold:
            print(f"--- Semantic Cache HIT with distance {distances[0][0]} ---")
            matched_faiss_id = ids[0][0]
            matched_uuid = id_to_uuid_map[matched_faiss_id]
            cached_response = redis_client.get(f"semantic_cache:{matched_uuid}")
            return json.loads(cached_response) if cached_response else None
        
        print("--- Semantic Cache MISS ---")
        return None
    
    def add_to_semantic_cache(query: str, response: dict):
        """Add a new query-answer pair to the cache."""
        global faiss_id_counter, index_map, id_to_uuid_map
    
        query_embedding = cache_embedding_model.encode([query], convert_to_numpy=True)
        new_uuid = str(uuid.uuid4())
        new_faiss_id = faiss_id_counter
    
        # Add to FAISS index
        index_map.add_with_ids(query_embedding, np.array([new_faiss_id]))
    
        # Update mappings
        id_to_uuid_map[new_faiss_id] = new_uuid
        faiss_id_counter += 1
    
        # Add to Redis answer store
        redis_client.set(f"semantic_cache:{new_uuid}", json.dumps(response))
        print(f"--- Semantic Cache SET for UUID: {new_uuid} ---")
    
    # --- Example Flow ---
    
    query1 = "What is the process for submitting an expense report?"
    query2 = "How do I file my expenses?"
    query3 = "What's the weather like today?"
    
    # 1. First query - miss
    response = search_semantic_cache(query1)
    if not response:
        # Simulate full RAG pipeline
        rag_response = {"answer": "To submit an expense report, log into the portal, navigate to 'My Expenses', and click 'New Report'.", "source_docs": ["hr_manual_p42.pdf"]}
        add_to_semantic_cache(query1, rag_response)
    
    # 2. Second, semantically similar query - hit!
    response = search_semantic_cache(query2)
    if response:
        print("Cached response:", response['answer'])
    
    # 3. Third, unrelated query - miss
    response = search_semantic_cache(query3)
    

    Production Implementation: Critical Details and Edge Cases

    Moving this architecture from a script to a robust production service requires addressing several critical factors.

    1. Cache Invalidation Strategy

    What happens when the source documents for your RAG system are updated? A cached answer might become stale or incorrect. This is the hardest problem in caching.

    * TTL-based Eviction: The simplest approach. Assign a TTL (e.g., 24 hours) to all cache entries. This guarantees eventual consistency but can serve stale data within the TTL window. This is acceptable for data that changes infrequently.

    Event-Driven Invalidation (Advanced): For systems requiring higher consistency, you need an active invalidation mechanism. When a source document is updated, deleted, or added, an event is published (e.g., to Kafka, SQS, or RabbitMQ). A cache invalidation service subscribes to these events. The challenge is knowing which* cached queries to invalidate. You can't simply find all answers that used the changed document, because you don't know which queries led to those answers.

    A Production Pattern: When caching a response, along with the answer, store the IDs of the source document chunks used to generate it.

    json
        // In Redis Answer Store
        "semantic_cache:<uuid>": {
            "answer": "...",
            "source_chunks": ["doc_abc_chunk_3", "doc_abc_chunk_5"]
        }

    Your invalidation service, upon receiving an event for doc_abc, can then scan the cache (or a secondary index) for all entries that relied on chunks from that document and purge them.

    2. Performance Tuning the ANN Index

    For the semantic cache, search latency is everything. While IndexFlatL2 is exact, it's an O(n) linear scan and won't scale. For production, you must use an ANN index like HNSW (Hierarchical Navigable Small Worlds).

    * HNSW with FAISS: faiss.IndexHNSWFlat(embedding_dim, M)

    * M: The number of neighbors per node in the graph. Higher M increases memory usage and index build time but can improve accuracy. A value of 32 or 64 is a good starting point.

    * efConstruction: A build-time parameter. Higher values lead to a better quality index at the cost of longer build times. This is a one-time cost, so it can be set high (e.g., 100).

    * efSearch: A search-time parameter. This is the most critical tuning knob. It controls the size of the dynamic list of entry points for the search. A higher efSearch increases accuracy at the cost of higher latency. You can tune this per-query to balance speed and quality.

    Example: Tuning efSearch

    python
    # In a production FAISS setup
    hnsw_index = faiss.IndexHNSWFlat(embedding_dim, 32)
    # ... add vectors to hnsw_index
    
    # At search time:
    # For a high-priority query where accuracy is key
    hnsw_index.hnsw.efSearch = 128 
    distances, ids = hnsw_index.search(query_embedding, k=1)
    
    # For a low-priority, high-throughput endpoint
    hnsw_index.hnsw.efSearch = 16
    distances, ids = hnsw_index.search(query_embedding, k=1)

    3. The Cold Start Problem

    A freshly deployed cache is empty and provides no benefit. Pre-populating, or 'warming', the cache is essential.

    * Log and Backfill: Log all production queries that result in a cache miss. Periodically, run an offline batch job to execute the full RAG pipeline for these queries and populate the cache. This is the most common and effective strategy.

    * Synthesize Questions: Use an LLM to pre-generate potential questions from your source document chunks. This can build a baseline cache before any real user traffic arrives. The quality of these questions can be variable, but it's better than an empty cache.

    Example using OpenAI for question synthesis:

    python
        import openai
        
        def synthesize_questions_from_chunk(document_chunk: str):
            response = openai.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant. Given the following text, generate 3 distinct questions that this text could answer. Respond with a JSON list of strings."},
                    {"role": "user", "content": document_chunk}
                ]
            )
            try:
                questions = json.loads(response.choices[0].message.content)
                return questions
            except:
                return []
    
        # chunk = "The SOC 2 Type 2 report is an audit of a service organization's systems..."
        # generated_questions = synthesize_questions_from_chunk(chunk)
        # # Now run these questions through your RAG pipeline to populate the cache.

    Full System Integration & Performance Benchmarking

    Let's assemble the full, multi-layered request flow and benchmark its performance.

    python
    # --- This is a simulation of the full end-to-end flow ---
    import time
    
    # Assume all previous functions (get_lexical_cache_key, etc.) are defined
    
    def full_rag_pipeline(query: str) -> dict:
        """Simulates the slow, expensive RAG pipeline."""
        print("--- EXECUTING FULL RAG PIPELINE (CACHE MISS) ---")
        time.sleep(2.5) # Simulate embedding, vector search, LLM call
        return {"answer": f"This is a freshly generated answer for: '{query}'", "source_docs": ["doc_xyz.pdf"]}
    
    def handle_query_request(query: str):
        """The main entry point for a user query."""
        start_time = time.time()
    
        # 1. Check Layer 1: Lexical Cache
        response = get_from_lexical_cache(query)
        if response:
            end_time = time.time()
            print(f"Total response time: {(end_time - start_time) * 1000:.2f} ms")
            return response
    
        # 2. Check Layer 2: Semantic Cache
        response = search_semantic_cache(query, threshold=0.98)
        if response:
            end_time = time.time()
            print(f"Total response time: {(end_time - start_time) * 1000:.2f} ms")
            return response
    
        # 3. Cache Miss: Execute full pipeline
        response = full_rag_pipeline(query)
    
        # 4. Populate caches for future requests
        set_in_lexical_cache(query, response)
        add_to_semantic_cache(query, response)
    
        end_time = time.time()
        print(f"Total response time: {(end_time - start_time) * 1000:.2f} ms")
        return response
    
    # --- Benchmark Run ---
    
    print("\n--- Run 1: Cold query ---")
    query_a = "How do I configure the SSO integration?"
    handle_query_request(query_a)
    
    print("\n--- Run 2: Identical query (Lexical Hit) ---")
    handle_query_request(query_a)
    
    print("\n--- Run 3: Semantically similar query (Semantic Hit) ---")
    query_b = "What are the steps to set up single sign-on?"
    handle_query_request(query_b)
    
    print("\n--- Run 4: Another cold query ---")
    query_c = "What is our data retention policy?"
    handle_query_request(query_c)
    

    Expected Benchmark Results:

    Path TakenExpected LatencyNotes
    Run 1: Full RAG Pipeline~2500+ msThe baseline 'slow' path. Incurs all network and compute costs.
    Run 2: Lexical Cache Hit< 10 msA single Redis GET operation. Extremely fast.
    Run 3: Semantic Cache Hit< 50 msIncludes local embedding (~20ms) and local FAISS search (~5-10ms).
    Run 4: Full RAG Pipeline~2500+ msAnother cache miss, populating the cache for a new topic.

    These results demonstrate the transformative impact of this architecture. Even a 5% hit rate on the lexical cache and a 20% hit rate on the semantic cache can dramatically reduce the average latency and computational cost of your entire system.

    Conclusion: Caching as a Core Architectural Pillar

    For senior engineers building robust, scalable, and cost-effective RAG systems, caching cannot be an afterthought. A multi-layered approach, combining the raw speed of a lexical key-value store with the intelligence of a vector-based semantic cache, is a necessity for providing an acceptable user experience.

    By carefully selecting fast embedding models for the cache layer, tuning ANN index parameters like efSearch, and implementing a robust, event-driven invalidation strategy, you can build a system that delivers sub-50ms responses for a significant portion of user queries. This not only improves user satisfaction but also drastically reduces the load on expensive LLM endpoints, leading to significant cost savings at scale. The patterns discussed here provide a production-ready blueprint for moving beyond naive RAG implementations to a truly performant and efficient AI-powered application.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles