Triton Dynamic Batching for High-Throughput Transformer Inference

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Concurrent Inference Challenge with Transformers

Transformer models have revolutionized NLP, but their computational cost presents a significant challenge in production. A single inference pass on a model like BERT, while fast in isolation, becomes a bottleneck under concurrent user traffic. The common deployment pattern of a web server queueing requests and sending them one-by-one to a GPU is fundamentally flawed. It ignores the massively parallel architecture of the GPU, which is designed to perform the same operation on large batches of data simultaneously. When you send a single sequence (batch size 1) to a GPU, you're using only a tiny fraction of its available CUDA cores, leaving the hardware mostly idle. This leads to two primary problems:

  • Low GPU Utilization: Your expensive A100 or H100 GPU sits at 5-10% utilization, while your request queue grows, and latency skyrockets.
  • Poor Throughput: The total number of inferences per second (IPS) is capped by the latency of a single, inefficient inference call, plus network and framework overhead.
  • This is a classic system design trade-off: do we optimize for the lowest possible latency for a single request, or for the highest overall system throughput? For most real-world applications, maximizing throughput to serve the maximum number of concurrent users within an acceptable latency Service Level Objective (SLO) is the goal. This is precisely where NVIDIA Triton's dynamic batching scheduler becomes indispensable.

    Baseline: The Naive Deployment

    Let's establish a baseline to see the problem firsthand. Imagine we're deploying a distilled BERT model, exported to ONNX format. Our initial Triton model repository might look like this:

    text
    /models
    └── distilbert
        ├── 1
        │   └── model.onnx
        └── config.pbtxt

    And the config.pbtxt is minimal, relying on Triton's auto-configuration:

    protobuf
    name: "distilbert"
    platform: "onnxruntime_onnx"
    max_batch_size: 0 # Disables batching entirely
    input [
      {
        name: "input_ids"
        data_type: TYPE_INT64
        dims: [ -1, -1 ]
      },
      {
        name: "attention_mask"
        data_type: TYPE_INT64
        dims: [ -1, -1 ]
      }
    ]
    output [
      {
        name: "last_hidden_state"
        data_type: TYPE_FP32
        dims: [ -1, -1, 768 ]
      }
    ]

    Setting max_batch_size: 0 explicitly tells Triton not to attempt any form of batching. Every request that arrives at the server is immediately sent to the ONNX Runtime backend for execution. If 100 requests arrive simultaneously, they will be processed sequentially by the model instance, leading to a massive queue and unacceptable tail latencies.

    Benchmarking this setup with Triton's perf_analyzer tool would reveal the problem clearly:

    bash
    # Simulate 100 concurrent users sending requests
    perf_analyzer -m distilbert -u localhost:8001 -i grpc --concurrency-range 100

    The output would show high average latency and a throughput figure far below the GPU's theoretical capacity. This is our starting point for optimization.

    Deep Dive into Triton's Dynamic Batcher

    The dynamic batcher is a scheduler implemented within Triton itself. It sits between the incoming request queue and the model execution backend. Its job is to intelligently collect individual inference requests over a short period, group them into a single, larger batch, and then submit that batch to the model. After the model processes the batch, the scheduler de-batches the results and returns the appropriate response to each original request.

    This process introduces a small amount of intentional delay for some requests, but the payoff is a massive increase in overall throughput because the GPU is now processing data in a way that aligns with its architecture.

    Configuring the Dynamic Batcher: `config.pbtxt`

    The magic happens within the dynamic_batching block of your config.pbtxt. Let's create an optimized configuration.

    protobuf
    name: "distilbert"
    platform: "onnxruntime_onnx"
    max_batch_size: 256 # IMPORTANT: Set a VRAM-appropriate max batch size
    
    input [
      {
        name: "input_ids"
        data_type: TYPE_INT64
        dims: [ -1 ] # Note: Dims now reflect a single item in the batch
      },
      {
        name: "attention_mask"
        data_type: TYPE_INT64
        dims: [ -1 ]
      }
    ]
    output [
      {
        name: "last_hidden_state"
        data_type: TYPE_FP32
        dims: [ -1, 768 ]
      }
    ]
    
    dynamic_batching {
      preferred_batch_size: [8, 16, 32] 
      max_queue_delay_microseconds: 5000 # The most critical tuning parameter
      default_queue_policy {
        timeout_action: REJECT
        default_timeout_microseconds: 10000000 # 10 seconds
        allow_timeout_override: false
      }
    }

    Let's break down these parameters, as their interaction is subtle and critical for performance.

    * max_batch_size: 256: This is the absolute maximum number of requests the scheduler can combine. This is not a performance target; it's a memory safeguard. You must determine this value by profiling your model to find the largest batch that fits comfortably within your GPU's VRAM. Exceeding this will lead to out-of-memory (OOM) errors. For a distilbert model with a sequence length of 128 on a 16GB GPU, 256 is a reasonable starting point.

    * preferred_batch_size: [8, 16, 32]: This is the performance tuning parameter. It tells the scheduler: "If you have accumulated a batch of this size, send it for inference immediately, even if the delay window hasn't closed." You should populate this array with batch sizes that show high performance on your specific GPU. For many Transformer architectures on NVIDIA Ampere or Hopper GPUs, batch sizes that are powers of 2 (especially multiples of 8 or 16) often yield the best performance due to Tensor Core alignment. You can find these optimal sizes by running perf_analyzer with different fixed batch sizes.

    * max_queue_delay_microseconds: 5000: This is the heart of the latency/throughput trade-off. It defines the maximum time (in microseconds) that Triton will wait to form a preferred batch. When the first request arrives for a new batch, a timer starts. Triton will continue to accumulate requests until either:

    1. A preferred_batch_size is reached. The batch is sent immediately.

    2. max_batch_size is reached. The batch is sent immediately.

    3. The max_queue_delay_microseconds timer expires. The current batch, whatever its size, is sent.

    A smaller delay (e.g., 1000 µs) prioritizes lower latency but may result in smaller, less efficient batches under light load. A larger delay (e.g., 10000 µs) maximizes the chance of forming large, efficient batches, boosting throughput at the cost of increased average latency.

    * default_queue_policy: This block handles requests that wait too long. timeout_action: REJECT will cause Triton to return an error for requests that exceed the default_timeout_microseconds, preventing them from languishing in the queue indefinitely. This is a crucial backpressure mechanism to maintain service stability under extreme load.

    Production Implementation and Benchmarking

    With our new config.pbtxt, let's structure our deployment and create a robust benchmarking client.

    Model Repository Structure

    Your model repository should be well-defined:

    text
    /models
    └── distilbert
        ├── 1
        │   └── model.onnx
        └── config.pbtxt  # Our new, optimized config

    You would launch Triton pointing to this repository:

    bash
    docker run --gpus all -it --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/models

    Advanced Benchmarking Client with `asyncio`

    While perf_analyzer is excellent, a custom client can help simulate more complex, real-world traffic patterns. Here is a Python client using asyncio and tritonclient to bombard the server with concurrent requests.

    python
    import asyncio
    import numpy as np
    import tritonclient.grpc.aio as grpcclient
    from transformers import AutoTokenizer
    import time
    import logging
    
    # --- Configuration ---
    MODEL_NAME = "distilbert"
    URL = "localhost:8001"
    NUM_CONCURRENT_REQUESTS = 200
    REQUESTS_PER_COROUTINE = 50
    SEQUENCE_LENGTH = 128
    
    # --- Logging Setup ---
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    
    # --- Prepare Input Data ---
    # In a real scenario, this would be dynamic user input.
    # Here, we pre-generate it for consistent benchmarking.
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    sample_text = "Triton's dynamic batching is a key feature for production throughput." * 5 # Make it long enough
    
    inputs = tokenizer(sample_text, return_tensors="np", max_length=SEQUENCE_LENGTH, padding="max_length", truncation=True)
    
    input_ids = inputs['input_ids'].astype(np.int64)
    attention_mask = inputs['attention_mask'].astype(np.int64)
    
    async def send_request(triton_client, request_id):
        """Sends a single inference request and returns its latency."""
        infer_input_ids = grpcclient.InferInput("input_ids", input_ids.shape, "INT64")
        infer_input_ids.set_data_from_numpy(input_ids)
    
        infer_input_mask = grpcclient.InferInput("attention_mask", attention_mask.shape, "INT64")
        infer_input_mask.set_data_from_numpy(attention_mask)
    
        start_time = time.perf_counter()
        try:
            await triton_client.infer(
                model_name=MODEL_NAME,
                inputs=[infer_input_ids, infer_input_mask],
                request_id=str(request_id)
            )
            end_time = time.perf_counter()
            return end_time - start_time
        except Exception as e:
            logging.error(f"Request {request_id} failed: {e}")
            return None
    
    async def worker(worker_id, triton_client, latencies):
        """A worker coroutine that sends a batch of requests."""
        logging.info(f"Worker {worker_id} starting.")
        for i in range(REQUESTS_PER_COROUTINE):
            request_id = f"{worker_id}-{i}"
            latency = await send_request(triton_client, request_id)
            if latency is not None:
                latencies.append(latency)
        logging.info(f"Worker {worker_id} finished.")
    
    async def main():
        """Main function to orchestrate the load test."""
        latencies = []
        async with grpcclient.InferenceServerClient(url=URL) as triton_client:
            start_total_time = time.perf_counter()
    
            tasks = [
                worker(i, triton_client, latencies) 
                for i in range(NUM_CONCURRENT_REQUESTS)
            ]
            await asyncio.gather(*tasks)
    
            end_total_time = time.perf_counter()
    
        total_time = end_total_time - start_total_time
        total_requests = len(latencies)
        throughput = total_requests / total_time
    
        if not latencies:
            logging.error("No successful requests were made.")
            return
    
        p95 = np.percentile(latencies, 95) * 1000
        p99 = np.percentile(latencies, 99) * 1000
        avg_latency = np.mean(latencies) * 1000
    
        logging.info(f"--- Benchmark Results ---")
        logging.info(f"Total Requests: {total_requests}")
        logging.info(f"Total Time: {total_time:.2f}s")
        logging.info(f"Throughput: {throughput:.2f} IPS")
        logging.info(f"Average Latency: {avg_latency:.2f} ms")
        logging.info(f"P95 Latency: {p95:.2f} ms")
        logging.info(f"P99 Latency: {p99:.2f} ms")
    
    if __name__ == "__main__":
        asyncio.run(main())
    

    Analyzing Benchmark Results

    Running this benchmark against both the naive (max_batch_size: 0) and the dynamically batched configurations would yield dramatically different results.

    Expected Results (Illustrative):

    ConfigurationThroughput (IPS)Avg Latency (ms)P99 Latency (ms)GPU Utilization
    Naive (max_batch_size: 0)851176350012%
    Dynamic Batching (delay: 5ms)75013325085%

    Analysis:

    * Throughput: The dynamic batching configuration shows an almost 9x increase in throughput. This is the primary win. We are serving significantly more users with the same hardware.

    Latency: While the minimum* latency of a single request in the naive setup might be lower (if it's the only one in the system), the average and tail latencies under load are disastrous. The dynamic batching setup provides a much more stable and predictable latency profile for all users, even though it intentionally adds a small delay. The P99 latency is an order of magnitude better, which is critical for a good user experience.

    * GPU Utilization: The nvidia-smi command during the benchmark would confirm the story: the GPU is working hard in the batched setup, as it should be.

    This demonstrates that for any service expecting more than a handful of concurrent requests, dynamic batching is not just an optimization—it's a requirement.

    Advanced Patterns and Edge Cases

    Mastering dynamic batching requires understanding how to handle more complex, real-world scenarios.

    Edge Case 1: Handling Variable Sequence Lengths

    Our benchmark assumed a fixed sequence length (128). In reality, user input varies. Sending a batch containing sequences of lengths [20, 110, 55] to a Transformer model requires all sequences to be padded to the length of the longest sequence in the batch (110). This means for the sequence of length 20, you are performing 90 tokens' worth of useless computation, wasting GPU cycles.

    Solution: Ragged Batching

    Triton, in conjunction with backends like ONNX Runtime or TensorRT, can handle this with a feature sometimes called "ragged batching." Instead of the client performing the padding, you send the raw, variable-length tensors. The model must be specifically constructed to handle this, but Triton facilitates it.

    Your config.pbtxt would be modified to signal that an input can be part of a ragged batch.

    protobuf
    # ... (other config) ...
    input [
      {
        name: "input_ids"
        # ... (other properties)
        allow_ragged_batch: true
      },
      {
        name: "attention_mask"
        # ... (other properties)
        allow_ragged_batch: true
      }
    ]
    # ...

    This is an advanced technique that requires changes to both the model export process and the client-side request preparation, but it can yield significant performance gains (1.5-2x) for workloads with highly variable input sizes by eliminating wasteful padding computations.

    Edge Case 2: Prioritizing High-Value Requests

    Imagine a scenario where some inference requests are from premium users and others are from free-tier users. You want to ensure premium requests are processed with lower latency, even under heavy load from free users.

    Solution: Priority Levels

    The dynamic batcher has a built-in priority queue. You can configure multiple priority levels in the config.pbtxt and then tag incoming requests with a desired priority.

    protobuf
    # ...
    dynamic_batching {
      preferred_batch_size: [8, 16, 32] 
      max_queue_delay_microseconds: 5000
      priority_levels: 3 # e.g., High, Medium, Low
      default_priority_level: 3 # Lowest priority
      priority_queue_policy {
        key: 1 # Highest priority
        value: {
          # High priority users get a smaller queue and timeout faster
          max_queue_size: 100
          timeout_action: REJECT
          default_timeout_microseconds: 2000000 # 2 seconds
        }
      }
      priority_queue_policy {
        key: 3 # Lowest priority
        value: {
          # Free users get a larger queue and longer timeout
          max_queue_size: 1000
          timeout_action: REJECT
          default_timeout_microseconds: 20000000 # 20 seconds
        }
      }
    }

    When a client sends a request, it can specify its priority:

    python
    # In your Python client
    await triton_client.infer(
        model_name=MODEL_NAME,
        inputs=[...],
        priority=1 # This is a premium user request
    )

    Triton's scheduler will always attempt to form batches from the highest available priority level first, ensuring your premium users' requests spend less time waiting in the queue.

    Edge Case 3: Interplay with Multiple Model Instances

    To further increase concurrency, you can configure Triton to load multiple instances of your model on the same GPU. This is done with the instance_group setting.

    protobuf
    # At the top level of config.pbtxt
    instance_group [
      {
        count: 2 # Load two instances of this model
        kind: KIND_GPU
      }
    ]
    # ... dynamic_batching config follows

    How does this interact with dynamic batching? Each model instance gets its own dynamic batching scheduler. When requests arrive, Triton's main scheduler assigns a request to a specific model instance (e.g., in a round-robin fashion). That instance's personal dynamic batcher then starts accumulating a batch. This allows two batches to be formed and potentially executed in parallel (if the GPU architecture and workload allow for it, via mechanisms like CUDA streams).

    This is a powerful combination: instance_group increases the potential for parallel execution, while dynamic_batching ensures each of those executions is highly efficient.

    Conclusion: A Production Checklist

    Dynamic batching is a non-negotiable component of any production-grade Transformer inference service built on Triton. It is the primary mechanism for translating raw GPU FLOPs into real-world throughput.

    Here is a checklist for senior engineers implementing this pattern:

  • Profile for max_batch_size: Before anything else, determine the largest batch size your model can handle within your GPU's VRAM. This is a hard limit. Use a simple script to incrementally increase batch size until you hit an OOM error, then back off by 20-30% for a safety margin.
  • Find preferred_batch_sizes: Use perf_analyzer or a custom script to test the throughput of various fixed batch sizes (e.g., 2, 4, 8, 16, 32, 64...). Identify the sizes that provide the best performance on your hardware and add them to the preferred_batch_size array.
  • Tune max_queue_delay_microseconds Iteratively: This is the most sensitive parameter. Start with a value that aligns with your P99 latency SLO (e.g., if your SLO is 200ms, don't set the delay to 100ms). Start low (e.g., 2000-5000 µs) and measure the throughput/latency trade-off under a realistic load test. Incrementally increase the delay and observe the impact. The optimal value is one that maximizes throughput while keeping P99 latency just within your SLO.
  • Implement Backpressure: Always configure a default_queue_policy with a reasonable timeout and REJECT action. An infinitely growing queue will always lead to system failure. Rejecting requests when overloaded is a more graceful failure mode.
  • Monitor in Production: Your work isn't done at deployment. Monitor key metrics: GPU utilization, inference count, and, most importantly, the average and max queue duration for your model (exposed via Triton's Prometheus metrics endpoint). If your queue duration is consistently hitting max_queue_delay_microseconds, your service is likely at capacity. If it's always near zero, you might be able to increase the delay to form better batches and increase throughput further.
  • By moving from a naive deployment to a finely tuned dynamic batching configuration, you can ensure your inference services are robust, scalable, and cost-effective, fully leveraging the powerful hardware at your disposal.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles