Stateful Inference with vLLM for Conversational AI at Scale

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Unseen Scalability Crisis in Conversational AI

In the world of large language models, generating a single, isolated response is a solved problem. However, building a responsive, scalable conversational agent that remembers context across multiple turns introduces a significant performance bottleneck that is often overlooked in initial development. The standard approach—prepending the entire chat history to each new user query—is computationally catastrophic at scale.

Consider a typical interaction:

  • User: What is vLLM?
  • Model: vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.
  • User: How does it achieve that?
  • To process the third step, a stateless service constructs a prompt like this:

    text
    <|user|>
    What is vLLM?
    <|assistant|>
    vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.
    <|user|>
    How does it achieve that?
    <|assistant|>

    The entire concatenated string is tokenized and fed to the model. The model's attention mechanism must then re-process every single token from the beginning of the conversation. This initial processing phase, known as prefill, is where the Key-Value (KV) cache for the attention layers is computed. As the conversation grows, the prefill phase dominates the generation time, leading to a linear increase in Time-to-First-Token (TTFT). For a conversation with 10 turns and an average of 150 tokens per turn, you are re-processing thousands of tokens just to generate the next response. This is prohibitively slow and expensive.

    This article presents a production-grade architecture for a stateful inference service using vLLM. We will leverage vLLM's highly efficient KV cache management, specifically its prefix caching, to eliminate redundant computation and achieve near-constant TTFT, regardless of the conversation's length. We will not be explaining the basics of vLLM; we assume you understand its purpose. Instead, we'll focus on the architectural patterns, implementation details, and critical edge cases you'll face in a real-world deployment.

    The Architectural Linchpin: vLLM's Prefix Caching

    While vLLM is renowned for PagedAttention, which solves KV cache memory fragmentation, the feature that directly enables our stateful architecture is its automatic prefix caching. When you send a sequence of token IDs to vLLM, it computes a hash of that sequence. The vLLM scheduler checks if a prefix of this sequence has already been processed and its corresponding KV cache blocks are still resident in GPU memory. If a match is found, it reuses the existing cache blocks and only performs the prefill computation for the new tokens at the end of the sequence.

    This is the key insight. Our stateful service doesn't need to manipulate vLLM's internal state directly. Instead, our service's responsibility shifts to maintaining the canonical token history for each conversation. By sending the full, ever-growing token history for each turn, we transform vLLM from a stateless text generator into a stateful conversation engine.

    Let's contrast the workload:

    * Stateless Approach:

    * Turn 1 (50 tokens): 50-token prefill.

    * Turn 2 (200 total tokens): 200-token prefill.

    * Turn 3 (400 total tokens): 400-token prefill.

    Result:* TTFT scales linearly with conversation history.

    * Stateful Approach with vLLM:

    * Turn 1 (50 tokens): 50-token prefill.

    * Turn 2 (200 total tokens): KV cache hit for first 150 tokens, 50-token prefill for the new part.

    * Turn 3 (400 total tokens): KV cache hit for first 350 tokens, 50-token prefill for the new part.

    Result: TTFT remains nearly constant, dominated only by the length of the new* user input.

    Now, let's build the service that makes this possible.

    Designing the Stateful Service Architecture

    Our service will be built with Python, using FastAPI for the web framework and vLLM's AsyncLLMEngine for non-blocking inference. State will be managed externally in Redis to allow for horizontal scaling, though we'll start with an in-memory dictionary for simplicity.

    Core Components:

  • vLLM Engine Wrapper: A singleton class that initializes and holds the AsyncLLMEngine.
  • State Manager: A class responsible for storing and retrieving conversation state. The primary piece of state is the list of token IDs for the conversation history.
  • FastAPI Endpoints: Two primary endpoints:
  • * POST /v1/conversations: Initiates a new conversation.

    * POST /v1/conversations/{conversation_id}/continue: Adds a new turn to an existing conversation.

    The State Object

    For each conversation, we need to track more than just the tokens. A robust state object stored in Redis (e.g., as a JSON blob) would look like this:

    json
    {
      "conversation_id": "conv_abc123",
      "token_ids": [1, 50256, 12, 837, ...],
      "created_at": 1678886400,
      "last_updated_at": 1678886520,
      "model_name": "meta-llama/Llama-2-7b-chat-hf"
    }

    For our implementation, we'll focus on token_ids, but in production, metadata is invaluable for debugging, analytics, and TTL management.

    Complete Implementation

    Here is a complete, runnable implementation of the stateful service. Save this as server.py.

    Dependencies:

    pip install "uvicorn[standard]" fastapi pydantic redis vllm transformers

    python
    import asyncio
    import json
    import time
    import uuid
    from contextlib import asynccontextmanager
    from typing import List, Optional
    
    import redis.asyncio as redis
    from fastapi import FastAPI, HTTPException, Request
    from fastapi.responses import StreamingResponse
    from pydantic import BaseModel
    from vllm import AsyncLLMEngine, SamplingParams, AsyncEngineArgs
    from transformers import AutoTokenizer
    
    # --- Configuration ---
    MODEL_DIR = "meta-llama/Llama-2-7b-chat-hf"
    REDIS_HOST = "localhost"
    REDIS_PORT = 6379
    
    # --- Global Objects ---
    # These will be initialized at startup
    engine: Optional[AsyncLLMEngine] = None
    tokenizer: Optional[AutoTokenizer] = None
    redis_client: Optional[redis.Redis] = None
    
    # --- Pydantic Models for API ---
    class ConversationTurn(BaseModel):
        role: str
        content: str
    
    class InitiateRequest(BaseModel):
        messages: List[ConversationTurn]
    
    class ContinueRequest(BaseModel):
        message: ConversationTurn
    
    # --- vLLM Engine and State Management ---
    
    @asynccontextmanager
    async def lifespan(app: FastAPI):
        # Startup
        global engine, tokenizer, redis_client
        print("INFO:     Initializing vLLM engine...")
        engine_args = AsyncEngineArgs(model=MODEL_DIR, max_model_len=4096)
        engine = AsyncLLMEngine.from_engine_args(engine_args)
        
        print("INFO:     Initializing tokenizer...")
        tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
        
        print("INFO:     Connecting to Redis...")
        redis_client = redis.Redis(host=REDIS_HOST, port=REDIS_PORT, decode_responses=True)
        try:
            await redis_client.ping()
            print("INFO:     Redis connection successful.")
        except Exception as e:
            print(f"ERROR:    Could not connect to Redis: {e}")
            # In a real app, you might want to exit or have a fallback
            redis_client = None 
    
        yield
        # Shutdown
        if redis_client:
            await redis_client.close()
        print("INFO:     Application shutdown.")
    
    class ConversationState:
        def __init__(self, conversation_id: str):
            self.conversation_id = conversation_id
            self.redis_key = f"conversation:{self.conversation_id}"
    
        async def get_history_tokens(self) -> Optional[List[int]]:
            if not redis_client:
                raise HTTPException(status_code=503, detail="Redis not available")
            history = await redis_client.get(self.redis_key)
            return json.loads(history) if history else None
    
        async def save_history_tokens(self, token_ids: List[int]):
            if not redis_client:
                raise HTTPException(status_code=503, detail="Redis not available")
            await redis_client.set(self.redis_key, json.dumps(token_ids), ex=3600) # 1 hour TTL
    
    # --- FastAPI Application ---
    
    app = FastAPI(lifespan=lifespan)
    
    @app.post("/v1/conversations")
    async def initiate_conversation(request: InitiateRequest):
        """Initiates a new conversation and returns the first response stream."""
        conversation_id = f"conv_{uuid.uuid4()}"
        state = ConversationState(conversation_id)
    
        # Apply chat template to initial messages
        try:
            prompt = tokenizer.apply_chat_template(
                conversation=request.messages,
                tokenize=False,
                add_generation_prompt=True
            )
        except Exception as e:
            raise HTTPException(status_code=400, detail=f"Error applying chat template: {e}")
    
        token_ids = tokenizer.encode(prompt)
        await state.save_history_tokens(token_ids)
    
        # This is essentially the same as the continue endpoint, so we can reuse the logic
        return await stream_response(conversation_id, token_ids, request)
    
    @app.post("/v1/conversations/{conversation_id}/continue")
    async def continue_conversation(conversation_id: str, request: ContinueRequest, raw_request: Request):
        """Continues an existing conversation and returns the next response stream."""
        state = ConversationState(conversation_id)
        
        history_tokens = await state.get_history_tokens()
        if not history_tokens:
            raise HTTPException(status_code=404, detail="Conversation not found")
    
        # Reconstruct the conversation from tokens to apply the template correctly
        # This is slightly inefficient but ensures correctness with the template
        history_text = tokenizer.decode(history_tokens)
        new_prompt_text = tokenizer.apply_chat_template(
            conversation=[request.message],
            tokenize=False,
            add_generation_prompt=False # We already have a history
        )
        
        # The new full prompt for this turn
        full_prompt = history_text + new_prompt_text
        token_ids = tokenizer.encode(full_prompt)
    
        # --- Context Window Management --- 
        # A simple sliding window strategy
        max_len = engine.engine.model_config.max_model_len
        if len(token_ids) > max_len:
            # A naive but effective approach: truncate from the beginning
            # A better approach would be to find the second message start and truncate before it
            token_ids = token_ids[len(token_ids) - max_len:]
            # Potentially re-encode to ensure we start with a BOS token if needed
            # For Llama2, the template adds it, so this is generally okay.
    
        return await stream_response(conversation_id, token_ids, raw_request)
    
    async def stream_response(conversation_id: str, token_ids: List[int], request: Request):
        """Handles the core streaming logic with vLLM."""
        if not engine:
            raise HTTPException(status_code=503, detail="vLLM engine not initialized")
    
        request_id = f"req_{uuid.uuid4()}"
        sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=1024)
        
        results_generator = engine.generate(None, sampling_params, request_id, prompt_token_ids=token_ids)
    
        async def generator():
            full_response_tokens = []
            async for request_output in results_generator:
                # Check if the client has disconnected
                if await request.is_disconnected():
                    await engine.abort(request_id)
                    print(f"INFO:     Client disconnected, aborting request {request_id}")
                    break
    
                # Stream back the new token
                token_text = request_output.outputs[0].text[len("".join(full_response_tokens)):]
                full_response_tokens.append(token_text)
                yield token_text
            
            # After streaming, update the state with the full conversation history
            final_output = request_output.outputs[0]
            updated_token_ids = token_ids + final_output.token_ids
            state = ConversationState(conversation_id)
            await state.save_history_tokens(updated_token_ids)
            print(f"INFO:     Conversation {conversation_id} updated with {len(final_output.token_ids)} new tokens.")
    
        return StreamingResponse(generator(), media_type="text/plain")
    
    if __name__ == "__main__":
        import uvicorn
        uvicorn.run(app, host="0.0.0.0", port=8000)
    

    Running the Service

  • Start a Redis server: docker run -d -p 6379:6379 redis
  • Run the FastAPI app: python server.py
  • Interacting with the API

    You can use curl to interact with the service.

    1. Start a new conversation:

    bash
    curl -N -X POST http://localhost:8000/v1/conversations \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {"role": "user", "content": "What is the difference between stateful and stateless LLM inference?"}
      ]
    }'

    The response will be a stream of text. You'll need to capture the conversation_id from the server logs for the next step (in a real app, you'd return this in a header or a structured JSON stream).

    Let's assume the server logged conv_....

    2. Continue the conversation:

    bash
    curl -N -X POST http://localhost:8000/v1/conversations/conv_.../continue \
    -H "Content-Type: application/json" \
    -d '{
      "message": {
        "role": "user", "content": "Which one is vLLM better suited for?"
      }
    }'

    The second response will be generated much faster (in terms of TTFT) because vLLM reuses the KV cache from the first turn.

    Advanced Edge Cases and Production Patterns

    This implementation is a solid foundation, but a production environment demands more resilience and sophistication.

    1. Context Window Management

    Our current implementation has a naive truncation strategy. If len(token_ids) > max_model_len, it simply slices the token list from the front. This can be problematic:

    * Broken Multi-byte Characters: A character might be represented by multiple tokens. Slicing can cut a character in half, leading to gibberish.

    * Loss of System Prompt: The most important context (the system prompt) is usually at the beginning. Truncating from the front removes it.

    A more robust strategy involves structured truncation:

    • Always preserve the first message (typically the system prompt).
    • Iterate through the conversation turns from the second message onwards.
    • Remove the oldest user/assistant turn pairs until the total token count is within the limit.
    • This requires storing the conversation history not as a flat list of tokens, but as a structured list of turns, and only tokenizing the final payload sent to vLLM.

    2. Scaling Stateful Services: The Sticky Session Problem

    What happens when you run this service on multiple Kubernetes pods behind a standard round-robin load balancer?

    Client -> Load Balancer -> Pod A (Turn 1)

    Client -> Load Balancer -> Pod B (Turn 2)

    This will fail. The KV cache for the conversation exists only in the GPU memory of Pod A. When the request for Turn 2 lands on Pod B, its vLLM engine has never seen this conversation's prefix. It will perform a full, slow prefill, completely defeating our architecture's purpose.

    Solution: Sticky Sessions

    You must configure your ingress controller or load balancer to use sticky sessions (also known as session affinity). All requests for a given conversation_id must be routed to the same pod.

    * Implementation: This is typically done using a cookie or a consistent hash of a header value (e.g., X-Conversation-ID).

    * Trade-offs: Sticky sessions introduce a new failure mode. If Pod A crashes, all the active conversations on it are lost. The client will be routed to a new pod, which will trigger a slow prefill as it rebuilds the KV cache from the history stored in Redis. This is a form of graceful degradation, but the user will experience a one-time latency spike.

    3. State Recovery and Engine Restarts

    The vLLM engine's internal KV cache is volatile. If the Python process restarts, the cache is wiped. Our architecture is inherently resilient to this because the source of truth is the token history in Redis.

    * Recovery Flow:

    1. Service pod crashes and restarts.

    2. vLLM engine starts with an empty KV cache.

    3. A continue request arrives for an existing conversation.

    4. The service retrieves the full token history from Redis and sends it to the new vLLM engine.

    5. The engine performs a one-time full prefill, rebuilding the KV cache for that conversation.

    6. Subsequent turns for that conversation are fast again.

    This design ensures durability of the conversation history at the cost of a temporary performance hit upon failover.

    Performance Benchmarking: Theory and Practice

    Let's quantify the impact. Assume a model that can process prefill tokens at 500 tokens/sec and decode (generate) tokens at 50 tokens/sec. The user adds 50 new tokens each turn, and the model generates 150 tokens.

    TurnHistory TokensNew TokensTotal Prefill Tokens (Stateless)TTFT (Stateless)Total Prefill Tokens (Stateful)TTFT (Stateful)
    105050100ms50100ms
    220050250500ms50100ms
    340050450900ms50100ms
    4600506501300ms50100ms
    5800508501700ms50100ms

    As you can see, the stateless approach quickly becomes unusable, with TTFT climbing into multiple seconds. The stateful approach maintains a consistent, low-latency user experience. This difference is the deciding factor between a demo-quality chatbot and a production-ready conversational AI system.

    Conclusion

    Building stateful LLM services is not an exotic requirement; it is a fundamental necessity for any application involving multi-turn dialogue. By moving beyond the naive stateless model and embracing the capabilities of modern inference engines like vLLM, we can build systems that are not only performant but also cost-effective.

    The key architectural takeaways are:

  • Externalize State: Your service's primary role is to manage the conversation's token history in a durable store like Redis. This history is the source of truth.
  • Leverage Prefix Caching: Trust vLLM's internal KV cache management. By sending the full token history on each turn, you enable its prefix caching to eliminate redundant computations.
  • Plan for Statefulness in Scaling: Standard round-robin load balancing is incompatible with this pattern. Architect for sticky sessions from day one.
  • Handle Edge Cases Gracefully: Implement robust context window management and understand the performance implications of failover scenarios.
  • By implementing these advanced patterns, you can build conversational AI that scales to thousands of concurrent users while maintaining the low latency and responsiveness required for a compelling user experience.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles