Maximizing LLM Throughput with vLLM's PagedAttention Mechanism

October 11, 2025

19 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Bottleneck in Autoregressive LLM Serving: The KV Cache Problem

For engineers deploying Large Language Models (LLMs) in production, the initial excitement of model.generate() quickly gives way to the harsh realities of performance engineering. The primary culprit for poor throughput and high latency under concurrent loads isn't the model's forward pass computation itself, but rather the inefficient management of its state—specifically, the Key-Value (KV) cache.

In autoregressive models like GPT, Llama, or Mistral, each generated token depends on all preceding tokens. To avoid re-computing the entire sequence for every new token, we cache the key and value tensors from the attention layers for all previous tokens. This is the KV cache. While it dramatically speeds up individual generation, it introduces a severe memory management challenge on the GPU.

Why Naive KV Cache Management Fails at Scale

Dynamic and Unpredictable Size: The size of the KV cache grows linearly with the sequence length. For a batch of requests, each with a different prompt length and a different number of generated tokens, the memory footprint is highly dynamic and unpredictable. A request generating 50 tokens will have a small cache, while one generating 2000 tokens will have a massive one.

GPU Memory Fragmentation: Traditional approaches pre-allocate a contiguous block of GPU memory for each request's maximum possible KV cache size (max_sequence_length). If a request only generates 100 tokens but the max_sequence_length is 2048, over 95% of that allocated memory is wasted. This is internal fragmentation. When you have dozens of concurrent requests, the GPU memory becomes a checkerboard of used and unused (but allocated) blocks, preventing new requests from being scheduled even if sufficient total memory is free. It's the GPU equivalent of malloc hell.

Inefficient Batching: The standard approach is static batching. A group of requests is padded to the length of the longest sequence in the batch and processed together. The entire batch must wait for the last request to finish generating its full sequence before the next batch can start. This leads to GPUs sitting idle, waiting for the slowest member of the party. This is a direct cause of low GPU utilization and, consequently, low throughput.

Let's visualize the memory fragmentation problem. Imagine a GPU with 16GB of memory. We configure a max_sequence_length of 2048. Each request pre-allocates enough space for this length.

text

GPU Memory (Naive Allocation):

|-- Request 1 (uses 10%, 90% wasted) --|-- Request 2 (uses 50%, 50% wasted) --|-- Request 3 (uses 5%, 95% wasted) --| ... | Free Memory (but too small for a new request) |

This is the precise problem that vLLM's PagedAttention was designed to solve.

PagedAttention: Virtual Memory for the GPU

PagedAttention borrows a decades-old, battle-tested concept from operating systems: virtual memory and paging. Instead of allocating a single, large, contiguous block of memory for each sequence's KV cache, it allocates memory in smaller, fixed-size chunks called blocks.

Physical Blocks: The entire KV cache for all concurrent requests is stored in a pool of these physical blocks on the GPU. Each block can store the key-value pairs for a fixed number of tokens (e.g., 16 tokens).

Logical Blocks & Block Tables: Each sequence is assigned a block table, which is a mapping from its logical token positions to physical block locations. This is analogous to a page table in an OS. A sequence's KV cache can now be stored in non-contiguous physical blocks.

When the model needs to attend to previous tokens, the GPU kernel uses this block table to efficiently gather the required key and value vectors from their disparate locations in physical memory.

The Architectural Advantage

Elimination of Internal Fragmentation: Memory is allocated on-demand in small, block-sized increments as a sequence grows. If a sequence needs one more token's worth of KV cache space and its last block is full, the scheduler simply allocates a new block from the global pool and updates the sequence's block table. The waste is now confined to at most the last block of each sequence, reducing memory waste from ~50-90% in naive systems to less than 4%.

Efficient Memory Sharing: Complex generation strategies like parallel sampling or beam search generate multiple candidate sequences from a single prompt. With traditional methods, this requires copying the entire KV cache for each candidate. With PagedAttention, you simply copy the block table and map the new sequences to the same underlying physical blocks for the shared prompt portion. This is a copy-on-write mechanism, drastically reducing the memory overhead of such sampling methods.

python

# Conceptual representation of PagedAttention

class BlockTableManager:
    def __init__(self, total_blocks, block_size):
        self.free_blocks = list(range(total_blocks))
        self.block_size = block_size

    def allocate_block(self):
        if not self.free_blocks:
            raise MemoryError("Out of GPU blocks")
        return self.free_blocks.pop(0)

    def free_block(self, block_id):
        self.free_blocks.append(block_id)

class Sequence:
    def __init__(self, prompt_tokens, manager):
        self.tokens = list(prompt_tokens)
        self.block_table = []
        self.manager = manager
        self._allocate_for_new_tokens(len(self.tokens))

    def _allocate_for_new_tokens(self, num_tokens):
        # Simplified logic
        required_blocks = (len(self.tokens) + num_tokens) // self.manager.block_size - len(self.block_table)
        for _ in range(required_blocks):
            self.block_table.append(self.manager.allocate_block())

    def append_token(self, token_id):
        if len(self.tokens) % self.manager.block_size == 0:
            self.block_table.append(self.manager.allocate_block())
        self.tokens.append(token_id)

# With PagedAttention, this operation is cheap
def fork_sequence(parent_sequence: Sequence) -> Sequence:
    child = Sequence([], parent_sequence.manager)
    child.tokens = parent_sequence.tokens[:]
    # Just copy the list of block IDs, not the massive tensors inside
    child.block_table = parent_sequence.block_table[:]
    return child

This architectural shift enables the second key innovation in vLLM: continuous batching.

From Static to Continuous Batching

With memory management solved, vLLM can implement a much more dynamic and efficient scheduling algorithm.

Static Batching (The Old Way): Wait for N requests. Pad them all. Run the batch. Wait for all of them to finish. Release resources. Repeat.

Continuous Batching (The vLLM Way): The vLLM engine maintains a running batch of sequences. In each iteration of the model:

1. The scheduler checks for newly arrived requests in a waiting queue.

2. If there is enough free block memory, it adds new requests to the running batch.

3. It runs a single forward pass for the entire batch, generating one new token for each sequence.

4. It checks if any sequences in the batch have finished (reached an EOS token or max_tokens).

5. Finished sequences are removed from the batch, and their memory blocks are immediately freed and returned to the pool.

6. The process repeats, potentially adding new requests in the very next iteration.

This means the GPU is constantly processing tokens, never waiting for a whole batch to complete. A short request can start and finish while a long-running request is still in the middle of its generation, maximizing GPU utilization and dramatically increasing throughput.

Production Implementation: High-Throughput Async Server

Let's build a production-grade inference server using vLLM's asynchronous engine with FastAPI. This pattern is ideal for handling high-concurrency workloads.

Step 1: Dockerizing the Environment

First, we need a Dockerfile that sets up the CUDA environment, installs PyTorch, and vLLM. Using an official NVIDIA CUDA image is crucial.

dockerfile

# Use an official NVIDIA CUDA base image
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive

# Install Python and other essentials
RUN apt-get update && apt-get install -y \
    python3.10 python3.10-venv python3-pip git \
    && rm -rf /var/lib/apt/lists/*

# Set up a virtual environment
RUN python3.10 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install PyTorch and vLLM
# Ensure torch version matches CUDA version for optimal performance
RUN pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121
RUN pip install vllm==0.3.2 fastapi uvicorn sse-starlette pydantic

WORKDIR /app

COPY ./src /app

# Expose the port our API will run on
EXPOSE 8000

# Command to run the service
# We'll use uvicorn with multiple workers for production
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Step 2: The Asynchronous FastAPI Server

We'll use vllm.AsyncLLMEngine and vllm.AsyncEngineArgs to configure and run the model. This engine runs the model in a separate process, and our API communicates with it asynchronously, preventing the API server from blocking on long generation tasks.

src/main.py:

python

import asyncio
import time
import uuid
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse
from pydantic import BaseModel
from typing import Optional, List

from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid

# --- Pydantic Models for API --- #
class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7
    top_p: float = 0.95
    stream: bool = False

# --- Engine Configuration --- #
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"

# These settings are critical for performance.
# Adjust them based on your GPU memory and expected workload.
engine_args = AsyncEngineArgs(
    model=MODEL_ID,
    tokenizer=MODEL_ID,
    tensor_parallel_size=1,  # Set to number of GPUs for multi-GPU inference
    dtype="auto",
    seed=0,
    gpu_memory_utilization=0.90, # Use 90% of GPU memory
    max_num_batched_tokens=4096, # Fine-tune based on VRAM
    enforce_eager=False, # Use CUDA graphs for faster kernel execution
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

app = FastAPI()

# --- API Endpoints --- #
@app.post("/generate")
async def generate(request: GenerationRequest):
    request_id = f"cmpl-{random_uuid()}"

    sampling_params = SamplingParams(
        n=1,
        temperature=request.temperature,
        top_p=request.top_p,
        max_tokens=request.max_tokens,
    )

    # For streaming responses
    if request.stream:
        async def stream_results():
            async for result in engine.generate(request.prompt, sampling_params, request_id):
                # In streaming, we get one result object per generated token
                # The last result will have finished_reason set
                text_outputs = [output.text for output in result.outputs]
                # Simple SSE format
                yield f"data: {{"text": "{text_outputs[0]}"}}\n\n"
                if result.finished:
                    break
        return StreamingResponse(stream_results(), media_type="text/event-stream")

    # For non-streaming responses
    else:
        final_output = None
        async for result in engine.generate(request.prompt, sampling_params, request_id):
            if result.finished:
                final_output = result
                break
        
        if final_output is None:
             return JSONResponse({"error": "Generation failed"}, status_code=500)

        text_outputs = [output.text for output in final_output.outputs]
        return JSONResponse({
            "request_id": request_id,
            "prompt": final_output.prompt,
            "text": text_outputs,
            "usage": {
                "prompt_tokens": len(final_output.prompt_token_ids),
                "completion_tokens": len(final_output.outputs[0].token_ids)
            }
        })

@app.get("/health")
async def health_check():
    return {"status": "ok"}

Key Production Patterns in this Code:

* AsyncLLMEngine: This is the core of the non-blocking architecture. The FastAPI event loop can handle thousands of incoming connections while waiting for the model to process batches.

* gpu_memory_utilization=0.90: We explicitly tell vLLM to reserve 90% of the GPU VRAM for its block pool. This prevents out-of-memory errors and leaves a small buffer for the OS and other processes.

* enforce_eager=False: This enables CUDA graphs, which can significantly speed up the execution of the model's kernels by reducing launch overhead, especially for smaller batches.

* Streaming with StreamingResponse: The stream_results async generator yields tokens as they are produced, allowing for real-time, interactive applications like chatbots. This is implemented using Server-Sent Events (SSE).

Advanced Scenarios and Performance Tuning

Multi-GPU Serving with Tensor Parallelism

For models too large to fit on a single GPU (e.g., Llama-70B), vLLM leverages tensor parallelism. The model's weights are sharded across multiple GPUs, and each GPU computes on its portion of the tensors. Activating this is remarkably simple.

Modify AsyncEngineArgs: Change tensor_parallel_size to the number of GPUs you want to use.

python

    engine_args = AsyncEngineArgs(
        # ... other args
        tensor_parallel_size=4, # Use 4 GPUs
    )

Launch with Docker: Ensure the Docker container has access to all GPUs.

bash

    docker run --gpus all -p 8000:8000 my-vllm-server

vLLM handles the complex synchronization and communication between GPUs (e.g., all-reduce operations) under the hood.

Edge Case: Handling Aborted Requests

In a real-world system, clients might disconnect. A naive server would continue generating tokens for the aborted request, wasting precious GPU cycles. vLLM's engine is designed to handle this.

python

@app.post("/generate")
async def generate(request: Request, body: GenerationRequest):
    # ... (setup code)
    request_id = f"cmpl-{random_uuid()}"

    try:
        # ... (generation logic)
        async for result in engine.generate(body.prompt, sampling_params, request_id):
            # Check if client is still connected before processing
            if await request.is_disconnected():
                print(f"Client disconnected, aborting request {request_id}")
                await engine.abort(request_id)
                break
            # ... (yield or process result)
    except asyncio.CancelledError:
        # This block is entered if the request is aborted
        print(f"Request {request_id} was cancelled and aborted.")

By calling await engine.abort(request_id), we instruct the vLLM scheduler to immediately stop processing this sequence and free its associated memory blocks, making them available for other requests instantly.

Performance Benchmarking: Quantifying the Gains

To truly appreciate the impact of vLLM, we must benchmark it against a baseline, such as a simple Hugging Face pipeline wrapped in FastAPI. We can use a load testing tool like locust or a custom Python script with asyncio and aiohttp.

Benchmark Script (benchmark.py)

python

import asyncio
import aiohttp
import time
import random

URL = "http://localhost:8000/generate"
NUM_REQUESTS = 200
CONCURRENCY = 50

prompts = [
    "Write a short story about a robot who discovers music.",
    "Explain the theory of relativity in simple terms.",
    "What are the top 5 benefits of using Kubernetes?",
    "Translate 'Hello, how are you?' to French.",
    "Generate a python function to calculate the factorial of a number."
]

async def send_request(session, payload):
    start_time = time.time()
    async with session.post(URL, json=payload) as response:
        if response.status == 200:
            await response.json() # Consume the response body
            return time.time() - start_time
        else:
            print(f"Request failed with status: {response.status}")
            return None

async def main():
    tasks = []
    latencies = []
    async with aiohttp.ClientSession() as session:
        for i in range(NUM_REQUESTS):
            payload = {
                "prompt": random.choice(prompts),
                "max_tokens": random.randint(100, 512),
                "stream": False
            }
            tasks.append(send_request(session, payload))
            if len(tasks) >= CONCURRENCY or i == NUM_REQUESTS - 1:
                results = await asyncio.gather(*tasks)
                latencies.extend([r for r in results if r is not None])
                tasks = []

    total_time = sum(latencies)
    num_successful = len(latencies)
    throughput = num_successful / total_time if total_time > 0 else 0
    avg_latency = total_time / num_successful if num_successful > 0 else 0
    
    print(f"--- Benchmark Results ---")
    print(f"Total successful requests: {num_successful}/{NUM_REQUESTS}")
    print(f"Average Latency: {avg_latency:.4f} seconds")
    print(f"Throughput: {throughput * CONCURRENCY:.4f} requests/second (approx)")

if __name__ == "__main__":
    asyncio.run(main())

Expected Benchmark Results (Conceptual)

Server Implementation	Concurrency	Throughput (tokens/sec)	P99 Latency (s)	GPU Utilization	Notes
Naive HF Pipeline + FastAPI	10	~150	8.5	~40%	High idle time due to static batching.
Naive HF Pipeline + FastAPI	50	~160 (choked)	>30 (OOM errors)	~45%	Fails under load, frequent OOMs.
vLLM + FastAPI	10	~1800	1.2	~95%	High utilization, low latency.
vLLM + FastAPI	50	~2200	3.8	~98%	Scales gracefully, maintains high throughput.

These hypothetical results illustrate the order-of-magnitude difference. The naive implementation chokes as concurrency increases because of memory fragmentation and inefficient batching. vLLM, however, scales almost linearly until the GPU is fully saturated, delivering significantly higher throughput at a lower, more predictable latency.

Conclusion

For senior engineers building LLM-powered applications, moving beyond simplistic deployment models is not an option—it's a requirement for building scalable, cost-effective services. The traditional approach to KV cache management is fundamentally broken for high-concurrency workloads, leading to catastrophic memory fragmentation and abysmal GPU utilization.

vLLM's PagedAttention and continuous batching algorithms directly address these core architectural flaws. By adopting principles from operating system design, vLLM transforms GPU memory from a statically partitioned, inefficient resource into a dynamically managed, highly utilized asset. The result is a 10-20x increase in throughput, a dramatic reduction in latency, and the ability to serve much larger models on the same hardware. Mastering these techniques is essential for anyone serious about deploying LLMs in production.

The Bottleneck in Autoregressive LLM Serving: The KV Cache Problem

Why Naive KV Cache Management Fails at Scale

PagedAttention: Virtual Memory for the GPU

The Architectural Advantage

From Static to Continuous Batching

Production Implementation: High-Throughput Async Server

Step 1: Dockerizing the Environment

Step 2: The Asynchronous FastAPI Server

Advanced Scenarios and Performance Tuning

Multi-GPU Serving with Tensor Parallelism

Edge Case: Handling Aborted Requests

Performance Benchmarking: Quantifying the Gains

Conclusion

Found this article helpful?