Maximizing LLM Throughput with vLLM's PagedAttention Mechanism

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Bottleneck in Autoregressive LLM Serving: The KV Cache Problem

For engineers deploying Large Language Models (LLMs) in production, the initial excitement of model.generate() quickly gives way to the harsh realities of performance engineering. The primary culprit for poor throughput and high latency under concurrent loads isn't the model's forward pass computation itself, but rather the inefficient management of its state—specifically, the Key-Value (KV) cache.

In autoregressive models like GPT, Llama, or Mistral, each generated token depends on all preceding tokens. To avoid re-computing the entire sequence for every new token, we cache the key and value tensors from the attention layers for all previous tokens. This is the KV cache. While it dramatically speeds up individual generation, it introduces a severe memory management challenge on the GPU.

Why Naive KV Cache Management Fails at Scale

  • Dynamic and Unpredictable Size: The size of the KV cache grows linearly with the sequence length. For a batch of requests, each with a different prompt length and a different number of generated tokens, the memory footprint is highly dynamic and unpredictable. A request generating 50 tokens will have a small cache, while one generating 2000 tokens will have a massive one.
  • GPU Memory Fragmentation: Traditional approaches pre-allocate a contiguous block of GPU memory for each request's maximum possible KV cache size (max_sequence_length). If a request only generates 100 tokens but the max_sequence_length is 2048, over 95% of that allocated memory is wasted. This is internal fragmentation. When you have dozens of concurrent requests, the GPU memory becomes a checkerboard of used and unused (but allocated) blocks, preventing new requests from being scheduled even if sufficient total memory is free. It's the GPU equivalent of malloc hell.
  • Inefficient Batching: The standard approach is static batching. A group of requests is padded to the length of the longest sequence in the batch and processed together. The entire batch must wait for the last request to finish generating its full sequence before the next batch can start. This leads to GPUs sitting idle, waiting for the slowest member of the party. This is a direct cause of low GPU utilization and, consequently, low throughput.
  • Let's visualize the memory fragmentation problem. Imagine a GPU with 16GB of memory. We configure a max_sequence_length of 2048. Each request pre-allocates enough space for this length.

    text
    GPU Memory (Naive Allocation):
    
    |-- Request 1 (uses 10%, 90% wasted) --|-- Request 2 (uses 50%, 50% wasted) --|-- Request 3 (uses 5%, 95% wasted) --| ... | Free Memory (but too small for a new request) |
    

    This is the precise problem that vLLM's PagedAttention was designed to solve.

    PagedAttention: Virtual Memory for the GPU

    PagedAttention borrows a decades-old, battle-tested concept from operating systems: virtual memory and paging. Instead of allocating a single, large, contiguous block of memory for each sequence's KV cache, it allocates memory in smaller, fixed-size chunks called blocks.

  • Physical Blocks: The entire KV cache for all concurrent requests is stored in a pool of these physical blocks on the GPU. Each block can store the key-value pairs for a fixed number of tokens (e.g., 16 tokens).
  • Logical Blocks & Block Tables: Each sequence is assigned a block table, which is a mapping from its logical token positions to physical block locations. This is analogous to a page table in an OS. A sequence's KV cache can now be stored in non-contiguous physical blocks.
  • When the model needs to attend to previous tokens, the GPU kernel uses this block table to efficiently gather the required key and value vectors from their disparate locations in physical memory.

    The Architectural Advantage

  • Elimination of Internal Fragmentation: Memory is allocated on-demand in small, block-sized increments as a sequence grows. If a sequence needs one more token's worth of KV cache space and its last block is full, the scheduler simply allocates a new block from the global pool and updates the sequence's block table. The waste is now confined to at most the last block of each sequence, reducing memory waste from ~50-90% in naive systems to less than 4%.
  • Efficient Memory Sharing: Complex generation strategies like parallel sampling or beam search generate multiple candidate sequences from a single prompt. With traditional methods, this requires copying the entire KV cache for each candidate. With PagedAttention, you simply copy the block table and map the new sequences to the same underlying physical blocks for the shared prompt portion. This is a copy-on-write mechanism, drastically reducing the memory overhead of such sampling methods.
  • python
    # Conceptual representation of PagedAttention
    
    class BlockTableManager:
        def __init__(self, total_blocks, block_size):
            self.free_blocks = list(range(total_blocks))
            self.block_size = block_size
    
        def allocate_block(self):
            if not self.free_blocks:
                raise MemoryError("Out of GPU blocks")
            return self.free_blocks.pop(0)
    
        def free_block(self, block_id):
            self.free_blocks.append(block_id)
    
    class Sequence:
        def __init__(self, prompt_tokens, manager):
            self.tokens = list(prompt_tokens)
            self.block_table = []
            self.manager = manager
            self._allocate_for_new_tokens(len(self.tokens))
    
        def _allocate_for_new_tokens(self, num_tokens):
            # Simplified logic
            required_blocks = (len(self.tokens) + num_tokens) // self.manager.block_size - len(self.block_table)
            for _ in range(required_blocks):
                self.block_table.append(self.manager.allocate_block())
    
        def append_token(self, token_id):
            if len(self.tokens) % self.manager.block_size == 0:
                self.block_table.append(self.manager.allocate_block())
            self.tokens.append(token_id)
    
    # With PagedAttention, this operation is cheap
    def fork_sequence(parent_sequence: Sequence) -> Sequence:
        child = Sequence([], parent_sequence.manager)
        child.tokens = parent_sequence.tokens[:]
        # Just copy the list of block IDs, not the massive tensors inside
        child.block_table = parent_sequence.block_table[:]
        return child

    This architectural shift enables the second key innovation in vLLM: continuous batching.

    From Static to Continuous Batching

    With memory management solved, vLLM can implement a much more dynamic and efficient scheduling algorithm.

  • Static Batching (The Old Way): Wait for N requests. Pad them all. Run the batch. Wait for all of them to finish. Release resources. Repeat.
  • Continuous Batching (The vLLM Way): The vLLM engine maintains a running batch of sequences. In each iteration of the model:
  • 1. The scheduler checks for newly arrived requests in a waiting queue.

    2. If there is enough free block memory, it adds new requests to the running batch.

    3. It runs a single forward pass for the entire batch, generating one new token for each sequence.

    4. It checks if any sequences in the batch have finished (reached an EOS token or max_tokens).

    5. Finished sequences are removed from the batch, and their memory blocks are immediately freed and returned to the pool.

    6. The process repeats, potentially adding new requests in the very next iteration.

    This means the GPU is constantly processing tokens, never waiting for a whole batch to complete. A short request can start and finish while a long-running request is still in the middle of its generation, maximizing GPU utilization and dramatically increasing throughput.

    Production Implementation: High-Throughput Async Server

    Let's build a production-grade inference server using vLLM's asynchronous engine with FastAPI. This pattern is ideal for handling high-concurrency workloads.

    Step 1: Dockerizing the Environment

    First, we need a Dockerfile that sets up the CUDA environment, installs PyTorch, and vLLM. Using an official NVIDIA CUDA image is crucial.

    dockerfile
    # Use an official NVIDIA CUDA base image
    FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
    
    ENV DEBIAN_FRONTEND=noninteractive
    
    # Install Python and other essentials
    RUN apt-get update && apt-get install -y \
        python3.10 python3.10-venv python3-pip git \
        && rm -rf /var/lib/apt/lists/*
    
    # Set up a virtual environment
    RUN python3.10 -m venv /opt/venv
    ENV PATH="/opt/venv/bin:$PATH"
    
    # Install PyTorch and vLLM
    # Ensure torch version matches CUDA version for optimal performance
    RUN pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121
    RUN pip install vllm==0.3.2 fastapi uvicorn sse-starlette pydantic
    
    WORKDIR /app
    
    COPY ./src /app
    
    # Expose the port our API will run on
    EXPOSE 8000
    
    # Command to run the service
    # We'll use uvicorn with multiple workers for production
    CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
    

    Step 2: The Asynchronous FastAPI Server

    We'll use vllm.AsyncLLMEngine and vllm.AsyncEngineArgs to configure and run the model. This engine runs the model in a separate process, and our API communicates with it asynchronously, preventing the API server from blocking on long generation tasks.

    src/main.py:

    python
    import asyncio
    import time
    import uuid
    from fastapi import FastAPI, Request
    from fastapi.responses import JSONResponse, StreamingResponse
    from pydantic import BaseModel
    from typing import Optional, List
    
    from vllm.engine.arg_utils import AsyncEngineArgs
    from vllm.engine.async_llm_engine import AsyncLLMEngine
    from vllm.sampling_params import SamplingParams
    from vllm.utils import random_uuid
    
    # --- Pydantic Models for API --- #
    class GenerationRequest(BaseModel):
        prompt: str
        max_tokens: int = 256
        temperature: float = 0.7
        top_p: float = 0.95
        stream: bool = False
    
    # --- Engine Configuration --- #
    MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"
    
    # These settings are critical for performance.
    # Adjust them based on your GPU memory and expected workload.
    engine_args = AsyncEngineArgs(
        model=MODEL_ID,
        tokenizer=MODEL_ID,
        tensor_parallel_size=1,  # Set to number of GPUs for multi-GPU inference
        dtype="auto",
        seed=0,
        gpu_memory_utilization=0.90, # Use 90% of GPU memory
        max_num_batched_tokens=4096, # Fine-tune based on VRAM
        enforce_eager=False, # Use CUDA graphs for faster kernel execution
    )
    
    engine = AsyncLLMEngine.from_engine_args(engine_args)
    
    app = FastAPI()
    
    # --- API Endpoints --- #
    @app.post("/generate")
    async def generate(request: GenerationRequest):
        request_id = f"cmpl-{random_uuid()}"
    
        sampling_params = SamplingParams(
            n=1,
            temperature=request.temperature,
            top_p=request.top_p,
            max_tokens=request.max_tokens,
        )
    
        # For streaming responses
        if request.stream:
            async def stream_results():
                async for result in engine.generate(request.prompt, sampling_params, request_id):
                    # In streaming, we get one result object per generated token
                    # The last result will have finished_reason set
                    text_outputs = [output.text for output in result.outputs]
                    # Simple SSE format
                    yield f"data: {{"text": "{text_outputs[0]}"}}\n\n"
                    if result.finished:
                        break
            return StreamingResponse(stream_results(), media_type="text/event-stream")
    
        # For non-streaming responses
        else:
            final_output = None
            async for result in engine.generate(request.prompt, sampling_params, request_id):
                if result.finished:
                    final_output = result
                    break
            
            if final_output is None:
                 return JSONResponse({"error": "Generation failed"}, status_code=500)
    
            text_outputs = [output.text for output in final_output.outputs]
            return JSONResponse({
                "request_id": request_id,
                "prompt": final_output.prompt,
                "text": text_outputs,
                "usage": {
                    "prompt_tokens": len(final_output.prompt_token_ids),
                    "completion_tokens": len(final_output.outputs[0].token_ids)
                }
            })
    
    @app.get("/health")
    async def health_check():
        return {"status": "ok"}
    

    Key Production Patterns in this Code:

    * AsyncLLMEngine: This is the core of the non-blocking architecture. The FastAPI event loop can handle thousands of incoming connections while waiting for the model to process batches.

    * gpu_memory_utilization=0.90: We explicitly tell vLLM to reserve 90% of the GPU VRAM for its block pool. This prevents out-of-memory errors and leaves a small buffer for the OS and other processes.

    * enforce_eager=False: This enables CUDA graphs, which can significantly speed up the execution of the model's kernels by reducing launch overhead, especially for smaller batches.

    * Streaming with StreamingResponse: The stream_results async generator yields tokens as they are produced, allowing for real-time, interactive applications like chatbots. This is implemented using Server-Sent Events (SSE).

    Advanced Scenarios and Performance Tuning

    Multi-GPU Serving with Tensor Parallelism

    For models too large to fit on a single GPU (e.g., Llama-70B), vLLM leverages tensor parallelism. The model's weights are sharded across multiple GPUs, and each GPU computes on its portion of the tensors. Activating this is remarkably simple.

  • Modify AsyncEngineArgs: Change tensor_parallel_size to the number of GPUs you want to use.
  • python
        engine_args = AsyncEngineArgs(
            # ... other args
            tensor_parallel_size=4, # Use 4 GPUs
        )
  • Launch with Docker: Ensure the Docker container has access to all GPUs.
  • bash
        docker run --gpus all -p 8000:8000 my-vllm-server

    vLLM handles the complex synchronization and communication between GPUs (e.g., all-reduce operations) under the hood.

    Edge Case: Handling Aborted Requests

    In a real-world system, clients might disconnect. A naive server would continue generating tokens for the aborted request, wasting precious GPU cycles. vLLM's engine is designed to handle this.

    python
    @app.post("/generate")
    async def generate(request: Request, body: GenerationRequest):
        # ... (setup code)
        request_id = f"cmpl-{random_uuid()}"
    
        try:
            # ... (generation logic)
            async for result in engine.generate(body.prompt, sampling_params, request_id):
                # Check if client is still connected before processing
                if await request.is_disconnected():
                    print(f"Client disconnected, aborting request {request_id}")
                    await engine.abort(request_id)
                    break
                # ... (yield or process result)
        except asyncio.CancelledError:
            # This block is entered if the request is aborted
            print(f"Request {request_id} was cancelled and aborted.")

    By calling await engine.abort(request_id), we instruct the vLLM scheduler to immediately stop processing this sequence and free its associated memory blocks, making them available for other requests instantly.

    Performance Benchmarking: Quantifying the Gains

    To truly appreciate the impact of vLLM, we must benchmark it against a baseline, such as a simple Hugging Face pipeline wrapped in FastAPI. We can use a load testing tool like locust or a custom Python script with asyncio and aiohttp.

    Benchmark Script (benchmark.py)

    python
    import asyncio
    import aiohttp
    import time
    import random
    
    URL = "http://localhost:8000/generate"
    NUM_REQUESTS = 200
    CONCURRENCY = 50
    
    prompts = [
        "Write a short story about a robot who discovers music.",
        "Explain the theory of relativity in simple terms.",
        "What are the top 5 benefits of using Kubernetes?",
        "Translate 'Hello, how are you?' to French.",
        "Generate a python function to calculate the factorial of a number."
    ]
    
    async def send_request(session, payload):
        start_time = time.time()
        async with session.post(URL, json=payload) as response:
            if response.status == 200:
                await response.json() # Consume the response body
                return time.time() - start_time
            else:
                print(f"Request failed with status: {response.status}")
                return None
    
    async def main():
        tasks = []
        latencies = []
        async with aiohttp.ClientSession() as session:
            for i in range(NUM_REQUESTS):
                payload = {
                    "prompt": random.choice(prompts),
                    "max_tokens": random.randint(100, 512),
                    "stream": False
                }
                tasks.append(send_request(session, payload))
                if len(tasks) >= CONCURRENCY or i == NUM_REQUESTS - 1:
                    results = await asyncio.gather(*tasks)
                    latencies.extend([r for r in results if r is not None])
                    tasks = []
    
        total_time = sum(latencies)
        num_successful = len(latencies)
        throughput = num_successful / total_time if total_time > 0 else 0
        avg_latency = total_time / num_successful if num_successful > 0 else 0
        
        print(f"--- Benchmark Results ---")
        print(f"Total successful requests: {num_successful}/{NUM_REQUESTS}")
        print(f"Average Latency: {avg_latency:.4f} seconds")
        print(f"Throughput: {throughput * CONCURRENCY:.4f} requests/second (approx)")
    
    if __name__ == "__main__":
        asyncio.run(main())

    Expected Benchmark Results (Conceptual)

    Server ImplementationConcurrencyThroughput (tokens/sec)P99 Latency (s)GPU UtilizationNotes
    Naive HF Pipeline + FastAPI10~1508.5~40%High idle time due to static batching.
    Naive HF Pipeline + FastAPI50~160 (choked)>30 (OOM errors)~45%Fails under load, frequent OOMs.
    vLLM + FastAPI10~18001.2~95%High utilization, low latency.
    vLLM + FastAPI50~22003.8~98%Scales gracefully, maintains high throughput.

    These hypothetical results illustrate the order-of-magnitude difference. The naive implementation chokes as concurrency increases because of memory fragmentation and inefficient batching. vLLM, however, scales almost linearly until the GPU is fully saturated, delivering significantly higher throughput at a lower, more predictable latency.

    Conclusion

    For senior engineers building LLM-powered applications, moving beyond simplistic deployment models is not an option—it's a requirement for building scalable, cost-effective services. The traditional approach to KV cache management is fundamentally broken for high-concurrency workloads, leading to catastrophic memory fragmentation and abysmal GPU utilization.

    vLLM's PagedAttention and continuous batching algorithms directly address these core architectural flaws. By adopting principles from operating system design, vLLM transforms GPU memory from a statically partitioned, inefficient resource into a dynamically managed, highly utilized asset. The result is a 10-20x increase in throughput, a dramatic reduction in latency, and the ability to serve much larger models on the same hardware. Mastering these techniques is essential for anyone serious about deploying LLMs in production.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles