Stateful Inference with vLLM for Conversational AI at Scale

October 2, 2025

18 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Unseen Scalability Crisis in Conversational AI

In the world of large language models, generating a single, isolated response is a solved problem. However, building a responsive, scalable conversational agent that remembers context across multiple turns introduces a significant performance bottleneck that is often overlooked in initial development. The standard approach—prepending the entire chat history to each new user query—is computationally catastrophic at scale.

Consider a typical interaction:

User: What is vLLM?

Model: vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.

User: How does it achieve that?

To process the third step, a stateless service constructs a prompt like this:

text

<|user|>
What is vLLM?
<|assistant|>
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.
<|user|>
How does it achieve that?
<|assistant|>

The entire concatenated string is tokenized and fed to the model. The model's attention mechanism must then re-process every single token from the beginning of the conversation. This initial processing phase, known as prefill, is where the Key-Value (KV) cache for the attention layers is computed. As the conversation grows, the prefill phase dominates the generation time, leading to a linear increase in Time-to-First-Token (TTFT). For a conversation with 10 turns and an average of 150 tokens per turn, you are re-processing thousands of tokens just to generate the next response. This is prohibitively slow and expensive.

This article presents a production-grade architecture for a stateful inference service using vLLM. We will leverage vLLM's highly efficient KV cache management, specifically its prefix caching, to eliminate redundant computation and achieve near-constant TTFT, regardless of the conversation's length. We will not be explaining the basics of vLLM; we assume you understand its purpose. Instead, we'll focus on the architectural patterns, implementation details, and critical edge cases you'll face in a real-world deployment.

The Architectural Linchpin: vLLM's Prefix Caching

While vLLM is renowned for PagedAttention, which solves KV cache memory fragmentation, the feature that directly enables our stateful architecture is its automatic prefix caching. When you send a sequence of token IDs to vLLM, it computes a hash of that sequence. The vLLM scheduler checks if a prefix of this sequence has already been processed and its corresponding KV cache blocks are still resident in GPU memory. If a match is found, it reuses the existing cache blocks and only performs the prefill computation for the new tokens at the end of the sequence.

This is the key insight. Our stateful service doesn't need to manipulate vLLM's internal state directly. Instead, our service's responsibility shifts to maintaining the canonical token history for each conversation. By sending the full, ever-growing token history for each turn, we transform vLLM from a stateless text generator into a stateful conversation engine.

Let's contrast the workload:

* Stateless Approach:

* Turn 1 (50 tokens): 50-token prefill.

* Turn 2 (200 total tokens): 200-token prefill.

* Turn 3 (400 total tokens): 400-token prefill.

Result:* TTFT scales linearly with conversation history.

* Stateful Approach with vLLM:

* Turn 1 (50 tokens): 50-token prefill.

* Turn 2 (200 total tokens): KV cache hit for first 150 tokens, 50-token prefill for the new part.

* Turn 3 (400 total tokens): KV cache hit for first 350 tokens, 50-token prefill for the new part.

Result: TTFT remains nearly constant, dominated only by the length of the new* user input.

Now, let's build the service that makes this possible.

Designing the Stateful Service Architecture

Our service will be built with Python, using FastAPI for the web framework and vLLM's AsyncLLMEngine for non-blocking inference. State will be managed externally in Redis to allow for horizontal scaling, though we'll start with an in-memory dictionary for simplicity.

Core Components:

vLLM Engine Wrapper: A singleton class that initializes and holds the AsyncLLMEngine.

State Manager: A class responsible for storing and retrieving conversation state. The primary piece of state is the list of token IDs for the conversation history.

FastAPI Endpoints: Two primary endpoints:

* POST /v1/conversations: Initiates a new conversation.

* POST /v1/conversations/{conversation_id}/continue: Adds a new turn to an existing conversation.

The State Object

For each conversation, we need to track more than just the tokens. A robust state object stored in Redis (e.g., as a JSON blob) would look like this:

json

{
  "conversation_id": "conv_abc123",
  "token_ids": [1, 50256, 12, 837, ...],
  "created_at": 1678886400,
  "last_updated_at": 1678886520,
  "model_name": "meta-llama/Llama-2-7b-chat-hf"
}

For our implementation, we'll focus on token_ids, but in production, metadata is invaluable for debugging, analytics, and TTL management.

Complete Implementation

Here is a complete, runnable implementation of the stateful service. Save this as server.py.

Dependencies:

pip install "uvicorn[standard]" fastapi pydantic redis vllm transformers

python

import asyncio
import json
import time
import uuid
from contextlib import asynccontextmanager
from typing import List, Optional

import redis.asyncio as redis
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from vllm import AsyncLLMEngine, SamplingParams, AsyncEngineArgs
from transformers import AutoTokenizer

# --- Configuration ---
MODEL_DIR = "meta-llama/Llama-2-7b-chat-hf"
REDIS_HOST = "localhost"
REDIS_PORT = 6379

# --- Global Objects ---
# These will be initialized at startup
engine: Optional[AsyncLLMEngine] = None
tokenizer: Optional[AutoTokenizer] = None
redis_client: Optional[redis.Redis] = None

# --- Pydantic Models for API ---
class ConversationTurn(BaseModel):
    role: str
    content: str

class InitiateRequest(BaseModel):
    messages: List[ConversationTurn]

class ContinueRequest(BaseModel):
    message: ConversationTurn

# --- vLLM Engine and State Management ---

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    global engine, tokenizer, redis_client
    print("INFO:     Initializing vLLM engine...")
    engine_args = AsyncEngineArgs(model=MODEL_DIR, max_model_len=4096)
    engine = AsyncLLMEngine.from_engine_args(engine_args)
    
    print("INFO:     Initializing tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    
    print("INFO:     Connecting to Redis...")
    redis_client = redis.Redis(host=REDIS_HOST, port=REDIS_PORT, decode_responses=True)
    try:
        await redis_client.ping()
        print("INFO:     Redis connection successful.")
    except Exception as e:
        print(f"ERROR:    Could not connect to Redis: {e}")
        # In a real app, you might want to exit or have a fallback
        redis_client = None 

    yield
    # Shutdown
    if redis_client:
        await redis_client.close()
    print("INFO:     Application shutdown.")

class ConversationState:
    def __init__(self, conversation_id: str):
        self.conversation_id = conversation_id
        self.redis_key = f"conversation:{self.conversation_id}"

    async def get_history_tokens(self) -> Optional[List[int]]:
        if not redis_client:
            raise HTTPException(status_code=503, detail="Redis not available")
        history = await redis_client.get(self.redis_key)
        return json.loads(history) if history else None

    async def save_history_tokens(self, token_ids: List[int]):
        if not redis_client:
            raise HTTPException(status_code=503, detail="Redis not available")
        await redis_client.set(self.redis_key, json.dumps(token_ids), ex=3600) # 1 hour TTL

# --- FastAPI Application ---

app = FastAPI(lifespan=lifespan)

@app.post("/v1/conversations")
async def initiate_conversation(request: InitiateRequest):
    """Initiates a new conversation and returns the first response stream."""
    conversation_id = f"conv_{uuid.uuid4()}"
    state = ConversationState(conversation_id)

    # Apply chat template to initial messages
    try:
        prompt = tokenizer.apply_chat_template(
            conversation=request.messages,
            tokenize=False,
            add_generation_prompt=True
        )
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Error applying chat template: {e}")

    token_ids = tokenizer.encode(prompt)
    await state.save_history_tokens(token_ids)

    # This is essentially the same as the continue endpoint, so we can reuse the logic
    return await stream_response(conversation_id, token_ids, request)

@app.post("/v1/conversations/{conversation_id}/continue")
async def continue_conversation(conversation_id: str, request: ContinueRequest, raw_request: Request):
    """Continues an existing conversation and returns the next response stream."""
    state = ConversationState(conversation_id)
    
    history_tokens = await state.get_history_tokens()
    if not history_tokens:
        raise HTTPException(status_code=404, detail="Conversation not found")

    # Reconstruct the conversation from tokens to apply the template correctly
    # This is slightly inefficient but ensures correctness with the template
    history_text = tokenizer.decode(history_tokens)
    new_prompt_text = tokenizer.apply_chat_template(
        conversation=[request.message],
        tokenize=False,
        add_generation_prompt=False # We already have a history
    )
    
    # The new full prompt for this turn
    full_prompt = history_text + new_prompt_text
    token_ids = tokenizer.encode(full_prompt)

    # --- Context Window Management --- 
    # A simple sliding window strategy
    max_len = engine.engine.model_config.max_model_len
    if len(token_ids) > max_len:
        # A naive but effective approach: truncate from the beginning
        # A better approach would be to find the second message start and truncate before it
        token_ids = token_ids[len(token_ids) - max_len:]
        # Potentially re-encode to ensure we start with a BOS token if needed
        # For Llama2, the template adds it, so this is generally okay.

    return await stream_response(conversation_id, token_ids, raw_request)

async def stream_response(conversation_id: str, token_ids: List[int], request: Request):
    """Handles the core streaming logic with vLLM."""
    if not engine:
        raise HTTPException(status_code=503, detail="vLLM engine not initialized")

    request_id = f"req_{uuid.uuid4()}"
    sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=1024)
    
    results_generator = engine.generate(None, sampling_params, request_id, prompt_token_ids=token_ids)

    async def generator():
        full_response_tokens = []
        async for request_output in results_generator:
            # Check if the client has disconnected
            if await request.is_disconnected():
                await engine.abort(request_id)
                print(f"INFO:     Client disconnected, aborting request {request_id}")
                break

            # Stream back the new token
            token_text = request_output.outputs[0].text[len("".join(full_response_tokens)):]
            full_response_tokens.append(token_text)
            yield token_text
        
        # After streaming, update the state with the full conversation history
        final_output = request_output.outputs[0]
        updated_token_ids = token_ids + final_output.token_ids
        state = ConversationState(conversation_id)
        await state.save_history_tokens(updated_token_ids)
        print(f"INFO:     Conversation {conversation_id} updated with {len(final_output.token_ids)} new tokens.")

    return StreamingResponse(generator(), media_type="text/plain")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Running the Service

Start a Redis server: docker run -d -p 6379:6379 redis

Run the FastAPI app: python server.py

Interacting with the API

You can use curl to interact with the service.

1. Start a new conversation:

bash

curl -N -X POST http://localhost:8000/v1/conversations \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "What is the difference between stateful and stateless LLM inference?"}
  ]
}'

The response will be a stream of text. You'll need to capture the conversation_id from the server logs for the next step (in a real app, you'd return this in a header or a structured JSON stream).

Let's assume the server logged conv_....

2. Continue the conversation:

bash

curl -N -X POST http://localhost:8000/v1/conversations/conv_.../continue \
-H "Content-Type: application/json" \
-d '{
  "message": {
    "role": "user", "content": "Which one is vLLM better suited for?"
  }
}'

The second response will be generated much faster (in terms of TTFT) because vLLM reuses the KV cache from the first turn.

Advanced Edge Cases and Production Patterns

This implementation is a solid foundation, but a production environment demands more resilience and sophistication.

1. Context Window Management

Our current implementation has a naive truncation strategy. If len(token_ids) > max_model_len, it simply slices the token list from the front. This can be problematic:

* Broken Multi-byte Characters: A character might be represented by multiple tokens. Slicing can cut a character in half, leading to gibberish.

* Loss of System Prompt: The most important context (the system prompt) is usually at the beginning. Truncating from the front removes it.

A more robust strategy involves structured truncation:

Always preserve the first message (typically the system prompt).
Iterate through the conversation turns from the second message onwards.
Remove the oldest user/assistant turn pairs until the total token count is within the limit.
This requires storing the conversation history not as a flat list of tokens, but as a structured list of turns, and only tokenizing the final payload sent to vLLM.

2. Scaling Stateful Services: The Sticky Session Problem

What happens when you run this service on multiple Kubernetes pods behind a standard round-robin load balancer?

Client -> Load Balancer -> Pod A (Turn 1)

Client -> Load Balancer -> Pod B (Turn 2)

This will fail. The KV cache for the conversation exists only in the GPU memory of Pod A. When the request for Turn 2 lands on Pod B, its vLLM engine has never seen this conversation's prefix. It will perform a full, slow prefill, completely defeating our architecture's purpose.

Solution: Sticky Sessions

You must configure your ingress controller or load balancer to use sticky sessions (also known as session affinity). All requests for a given conversation_id must be routed to the same pod.

* Implementation: This is typically done using a cookie or a consistent hash of a header value (e.g., X-Conversation-ID).

* Trade-offs: Sticky sessions introduce a new failure mode. If Pod A crashes, all the active conversations on it are lost. The client will be routed to a new pod, which will trigger a slow prefill as it rebuilds the KV cache from the history stored in Redis. This is a form of graceful degradation, but the user will experience a one-time latency spike.

3. State Recovery and Engine Restarts

The vLLM engine's internal KV cache is volatile. If the Python process restarts, the cache is wiped. Our architecture is inherently resilient to this because the source of truth is the token history in Redis.

* Recovery Flow:

1. Service pod crashes and restarts.

2. vLLM engine starts with an empty KV cache.

3. A continue request arrives for an existing conversation.

4. The service retrieves the full token history from Redis and sends it to the new vLLM engine.

5. The engine performs a one-time full prefill, rebuilding the KV cache for that conversation.

6. Subsequent turns for that conversation are fast again.

This design ensures durability of the conversation history at the cost of a temporary performance hit upon failover.

Performance Benchmarking: Theory and Practice

Let's quantify the impact. Assume a model that can process prefill tokens at 500 tokens/sec and decode (generate) tokens at 50 tokens/sec. The user adds 50 new tokens each turn, and the model generates 150 tokens.

Turn	History Tokens	New Tokens	Total Prefill Tokens (Stateless)	TTFT (Stateless)	Total Prefill Tokens (Stateful)	TTFT (Stateful)
1	0	50	50	100ms	50	100ms
2	200	50	250	500ms	50	100ms
3	400	50	450	900ms	50	100ms
4	600	50	650	1300ms	50	100ms
5	800	50	850	1700ms	50	100ms

As you can see, the stateless approach quickly becomes unusable, with TTFT climbing into multiple seconds. The stateful approach maintains a consistent, low-latency user experience. This difference is the deciding factor between a demo-quality chatbot and a production-ready conversational AI system.

Conclusion

Building stateful LLM services is not an exotic requirement; it is a fundamental necessity for any application involving multi-turn dialogue. By moving beyond the naive stateless model and embracing the capabilities of modern inference engines like vLLM, we can build systems that are not only performant but also cost-effective.

The key architectural takeaways are:

Externalize State: Your service's primary role is to manage the conversation's token history in a durable store like Redis. This history is the source of truth.

Leverage Prefix Caching: Trust vLLM's internal KV cache management. By sending the full token history on each turn, you enable its prefix caching to eliminate redundant computations.

Plan for Statefulness in Scaling: Standard round-robin load balancing is incompatible with this pattern. Architect for sticky sessions from day one.

Handle Edge Cases Gracefully: Implement robust context window management and understand the performance implications of failover scenarios.

By implementing these advanced patterns, you can build conversational AI that scales to thousands of concurrent users while maintaining the low latency and responsiveness required for a compelling user experience.