Triton Multi-GPU Dynamic Batching for Heterogeneous Model Serving

12 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Dilemma: Stranded GPU Capital

In production machine learning, the cost of inference is dominated by the hardware it runs on. A server equipped with multiple high-end GPUs represents a significant capital expenditure. The cardinal sin of MLOps is leaving that capital underutilized. A naive "one model, one GPU" deployment strategy is often the fast-track to inefficiency. You might have a massive computer-vision model that can saturate a GPU, but only during traffic spikes. Alongside it, you may need to serve a small, latency-sensitive text classification model that barely tickles the GPU's compute cores but demands immediate responses.

When these heterogeneous workloads coexist, contention arises. A long-running batch inference job from the vision model can block the NLP model, causing its latency to skyrocket and violate SLOs. Simply throwing more hardware at the problem is a losing game. The goal is to make multiple models play nicely on a shared, multi-GPU infrastructure, maximizing throughput for heavy models while protecting the latency of lighter ones.

This is where the NVIDIA Triton Inference Server shines, but its default settings are merely a starting point. True optimization requires a deep understanding of its two most powerful scheduling and placement features: the Dynamic Batcher and Instance Groups. This article dissects advanced strategies for combining these features to solve the heterogeneous model serving problem on multi-GPU systems. We will move from a poorly performing baseline to a finely-tuned configuration, backed by concrete config.pbtxt files and perf_analyzer benchmarks.


Our Laboratory: Heterogeneous Models and a Multi-GPU Server

To make this tangible, let's define our scenario. We have a server with two NVIDIA A10 GPUs (GPU 0 and GPU 1). We need to deploy two distinct models:

  • yolov8_ensemble: A compute-intensive object detection model. It benefits significantly from batching. High throughput is its primary performance metric.
  • distilbert_sentiment: A latency-sensitive text classification model based on DistilBERT. It must return responses in milliseconds. P99 latency is its critical SLO.
  • Our goal is to configure Triton to serve both models concurrently, maximizing yolov8_ensemble's throughput without sacrificing distilbert_sentiment's low latency.

    Baseline Configuration: The Path to Contention

    A common starting point is to create a minimal configuration for each model and let Triton handle the rest. Let's see why this is problematic.

    Here is the model repository structure:

    bash
    /model_repository
    ├── distilbert_sentiment
    │   ├── 1
    │   │   └── model.onnx
    │   └── config.pbtxt
    └── yolov8_ensemble
        ├── 1
        │   └── model.graphdef
        └── config.pbtxt

    Baseline yolov8_ensemble/config.pbtxt:

    pbtxt
    name: "yolov8_ensemble"
    platform: "tensorflow_graphdef"
    max_batch_size: 64
    
    input [
      {
        name: "images"
        data_type: TYPE_FP32
        dims: [ -1, -1, 3 ]
      }
    ]
    
    output [
      {
        name: "output0"
        data_type: TYPE_FP32
        dims: [ -1, 84, 8400 ]
      }
    ]
    
    # By default, Triton will create one instance on the first available GPU (GPU 0)

    Baseline distilbert_sentiment/config.pbtxt:

    pbtxt
    name: "distilbert_sentiment"
    platform: "onnxruntime_onnx"
    max_batch_size: 16
    
    input [
      {
        name: "input_ids"
        data_type: TYPE_INT64
        dims: [ -1, 128 ]
      },
      {
        name: "attention_mask"
        data_type: TYPE_INT64
        dims: [ -1, 128 ]
      }
    ]
    
    output [
      {
        name: "output"
        data_type: TYPE_FP32
        dims: [ -1, 2 ]
      }
    ]
    
    # By default, Triton will also create one instance on the first available GPU (GPU 0)

    With this setup, Triton places one instance of each model on GPU 0. When we bombard the server with requests for both models, we see the classic contention problem. We can simulate this using Triton's perf_analyzer.

    First, let's benchmark distilbert_sentiment alone to get a latency baseline.

    bash
    # Generate dummy data for perf_analyzer
    # (Assuming appropriate shapes for inputs)
    
    perf_analyzer -m distilbert_sentiment -u localhost:8001 --concurrency-range 1:4 -i grpc

    Result (Isolated):

    text
    *** Measurement Summary ***
      Concurrency: 1, throughput: 150 infer/sec, latency: 6500 usec
      Concurrency: 4, throughput: 450 infer/sec, latency: 8800 usec

    At low concurrency, we see a respectable latency of ~8.8ms. Now, let's start a high-throughput load on yolov8_ensemble and re-run the distilbert_sentiment benchmark simultaneously.

    bash
    # In one terminal, saturate yolov8_ensemble
    perf_analyzer -m yolov8_ensemble -u localhost:8001 --concurrency-range 32 -i grpc --shape images:640,640,3
    
    # In another terminal, re-run the latency test for distilbert_sentiment
    perf_analyzer -m distilbert_sentiment -u localhost:8001 --concurrency-range 1:4 -i grpc

    Result (distilbert_sentiment under contention):

    text
    *** Measurement Summary ***
      Concurrency: 1, throughput: 90 infer/sec, latency: 11000 usec
      Concurrency: 4, throughput: 200 infer/sec, latency: 19500 usec

    Our P99 latency for distilbert_sentiment has more than doubled from 8.8ms to 19.5ms. The compute-heavy YOLOv8 batches are hogging the GPU's CUDA cores, forcing the smaller, faster DistilBERT requests to wait in the queue. This is unacceptable for a latency-sensitive service.

    Advanced Pattern 1: Per-Model Dynamic Batcher Tuning

    The first lever we can pull is the dynamic_batching scheduler. It's not a one-size-fits-all setting. We must tune it specifically for the performance profile of each model.

    The dynamic batcher works by holding incoming requests for a configurable period (max_queue_delay_microseconds), hoping to accumulate a larger batch up to preferred_batch_size or max_batch_size. This amortization of kernel launch overheads is great for throughput but is a direct trade-off with latency.

    Our Strategy:

    * For yolov8_ensemble (Throughput-bound): We will allow a longer queue delay to form larger, more efficient batches.

    * For distilbert_sentiment (Latency-bound): We will use a very short (or zero) queue delay, effectively forcing the scheduler to dispatch smaller batches immediately.

    Optimized yolov8_ensemble/config.pbtxt:

    pbtxt
    name: "yolov8_ensemble"
    platform: "tensorflow_graphdef"
    max_batch_size: 64
    
    # ... inputs and outputs ...
    
    dynamic_batching {
      preferred_batch_size: [16, 32]  # Encourage scheduler to form batches of 16 or 32
      max_queue_delay_microseconds: 10000 # Wait up to 10ms to form a preferred batch
    }

    Optimized distilbert_sentiment/config.pbtxt:

    pbtxt
    name: "distilbert_sentiment"
    platform: "onnxruntime_onnx"
    max_batch_size: 16
    
    # ... inputs and outputs ...
    
    dynamic_batching {
      max_queue_delay_microseconds: 100 # Wait only 0.1ms. Almost immediate dispatch.
    }

    By setting max_queue_delay_microseconds to a very low value for distilbert_sentiment, we are telling Triton: "Do not wait to form a larger batch for this model. If a request arrives and an execution instance is free, send it for processing immediately." This prioritizes latency over batching efficiency for this specific model.

    After reloading Triton with these configurations (tritonserver --model-control-mode=poll ...) and re-running our contention benchmark, we see an improvement.

    Result (distilbert_sentiment with tuned batcher):

    text
    *** Measurement Summary ***
      Concurrency: 4, throughput: 280 infer/sec, latency: 14000 usec

    Latency has improved from 19.5ms to 14ms. This is better, but still significantly higher than our 8.8ms baseline. The models are still competing for the same execution resources on a single GPU. We need to physically separate their execution paths.

    Advanced Pattern 2: Strategic Placement with Instance Groups

    This is where instance_group becomes our most powerful tool. It allows us to control exactly how many execution instances of a model are created and on which devices (CPUs or specific GPUs) they are placed.

    Our server has two GPUs. Let's use them strategically.

    Strategy A: Hard Isolation

    The simplest multi-GPU strategy is to pin each model to a different GPU. yolov8_ensemble can have GPU 0, and distilbert_sentiment can have GPU 1. This provides the strongest performance isolation.

    yolov8_ensemble/config.pbtxt for GPU 0:

    pbtxt
    # ... name, platform, batching etc. ...
    
    instance_group [
      {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ] # Pin this instance to GPU 0
      }
    ]

    distilbert_sentiment/config.pbtxt for GPU 1:

    pbtxt
    # ... name, platform, batching etc. ...
    
    instance_group [
      {
        count: 1
        kind: KIND_GPU
        gpus: [ 1 ] # Pin this instance to GPU 1
      }
    ]

    Now, let's re-run the contention benchmark.

    Result (distilbert_sentiment with GPU isolation):

    text
    *** Measurement Summary ***
      Concurrency: 4, throughput: 445 infer/sec, latency: 8950 usec

    Success! The latency is back to ~9ms, almost identical to the isolated baseline. yolov8_ensemble can run massive batches on GPU 0 without affecting distilbert_sentiment on GPU 1. However, we've introduced a new inefficiency. The distilbert_sentiment model is so lightweight that it will barely utilize GPU 1. Meanwhile, if the yolov8_ensemble gets an extreme traffic spike, its requests will queue up on GPU 0, even though GPU 1 is mostly idle. We can do better.

    Strategy B: The Hybrid Power Pattern - Replication and Co-location

    A more sophisticated approach is to replicate the heavy model across all available GPUs to maximize its throughput potential, while co-locating the latency-sensitive model on one of those GPUs, relying on our tuned dynamic batcher and CUDA's concurrent execution capabilities to protect it.

    Our Hybrid Strategy:

  • Create two instances of yolov8_ensemble, one on GPU 0 and one on GPU 1. Triton's scheduler will automatically load-balance incoming requests between them.
  • Create one instance of distilbert_sentiment and place it on one of the GPUs, say GPU 0. It will share the hardware with one of the YOLO instances.
  • This pattern acknowledges that the GPU can handle multiple contexts. While a YOLO instance is busy with a heavy computation, the GPU can context-switch to execute the much shorter DistilBERT inference.

    yolov8_ensemble/config.pbtxt for Replication:

    pbtxt
    name: "yolov8_ensemble"
    # ... platform, inputs, outputs ...
    max_batch_size: 64
    dynamic_batching {
      preferred_batch_size: [16, 32]
      max_queue_delay_microseconds: 10000
    }
    
    # Create two instances, one on each GPU
    instance_group [
      {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ]
      },
      {
        count: 1
        kind: KIND_GPU
        gpus: [ 1 ]
      }
    ]

    distilbert_sentiment/config.pbtxt for Co-location:

    pbtxt
    name: "distilbert_sentiment"
    # ... platform, inputs, outputs ...
    max_batch_size: 16
    dynamic_batching {
      max_queue_delay_microseconds: 100
    }
    
    # Place a single instance on GPU 0, alongside a YOLO instance
    instance_group [
      {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ]
      }
    ]

    Let's analyze the expected outcome before benchmarking. The throughput of yolov8_ensemble should roughly double, as it now has two GPUs working on its request queue. The latency of distilbert_sentiment should remain low because:

    a) Its max_queue_delay is tiny, so it doesn't wait long in Triton's scheduler.

    b) Its execution time is very short, allowing it to find a window to execute on GPU 0 between the larger compute kernels of the YOLO model.

    Benchmarking the Hybrid Pattern:

    First, yolov8_ensemble throughput (--concurrency-range 64):

    * Single GPU: ~120 infer/sec

    * Hybrid (Dual GPU): ~235 infer/sec (Almost a 2x improvement!)

    Now, the critical test for distilbert_sentiment latency under full load:

    Result (distilbert_sentiment with Hybrid Pattern):

    text
    *** Measurement Summary ***
      Concurrency: 4, throughput: 420 infer/sec, latency: 9400 usec

    This is the optimal result. We have nearly doubled the throughput of our heavy model while incurring only a negligible (~0.5ms) latency penalty on our sensitive model compared to the isolated baseline. We are now using our hardware far more effectively.


    Edge Cases and Production Hardening

    Achieving this state is excellent, but a production environment introduces more complexity.

    Edge Case 1: GPU Memory Constraints

    Our configurations have implicitly assumed the models fit in GPU memory. max_batch_size isn't just a performance knob; it's a direct control over memory allocation. The peak memory usage for a model instance can be roughly estimated as:

    VRAM ≈ (Model Weights Size) + (max_batch_size * Size of Intermediate Tensors per Request)

    When co-locating models as we did in the hybrid pattern, you must ensure the sum of their memory footprints on that GPU does not exceed available VRAM. If GPU 0 has 16GB VRAM, you need to ensure:

    VRAM_yolo_instance_0 + VRAM_distilbert_instance_0 < 16GB

    If you are close to the limit, you may need to reduce max_batch_size on the co-located instance or choose not to co-locate at all. Triton's logs will show CUDA OOM (Out of Memory) errors if you exceed the limit.

    Edge Case 2: Heterogeneous GPUs

    What if our server has an A100 (GPU 0) and a T4 (GPU 1)? These GPUs have vastly different performance characteristics. A symmetric replication strategy (count: 1 on each) is suboptimal.

    In this scenario, you should allocate more instances or larger batch sizes to the more powerful GPU. You could, for example, place two YOLO instances on the A100 and only one on the T4.

    yolov8_ensemble/config.pbtxt for Heterogeneous GPUs:

    pbtxt
    # ...
    
    instance_group [
      {
        # Two instances on the powerful A100
        count: 2
        kind: KIND_GPU
        gpus: [ 0 ]
      },
      {
        # One instance on the less powerful T4
        count: 1
        kind: KIND_GPU
        gpus: [ 1 ]
      }
    ]

    This sends roughly two-thirds of the traffic to the faster GPU, aligning load with capability.

    Edge Case 3: Dynamic Model Updates and CPU Offloading

    In a live environment, models are constantly being updated. Using Triton's Model Control API to dynamically load a new model version can cause a temporary spike in GPU memory usage. If you are already near your memory capacity, this can cause the load to fail.

    A robust pattern is to designate one GPU as the primary for serving and another as a staging/secondary. Or, for rarely used or very large models, consider CPU-only instances (kind: KIND_CPU) as a fallback. This prevents a non-critical model from evicting a high-value model from GPU memory.

    Example: A utility model on CPU

    pbtxt
    name: "preprocessing_utility"
    platform: "python"
    
    # ...
    
    instance_group [
      {
        count: 4 # Can have more instances as CPU cores are plentiful
        kind: KIND_CPU
      }
    ]

    This ensures your expensive GPU resources are reserved for models that can actually leverage them.

    Conclusion: From Configuration to Architecture

    Optimizing a multi-model, multi-GPU Triton deployment is an exercise in moving from simple configuration to deliberate system architecture. It requires a deep understanding of both the workload characteristics and the server's hardware topology.

    The key takeaways for senior engineers are:

  • Never Use a Global Configuration: The dynamic_batching parameters must be tuned on a per-model basis to reflect its unique throughput/latency trade-off.
  • instance_group is Your Scalpel: Use it for precise control over placement. Start with isolation to establish baselines, then move to intelligent replication and co-location to maximize utilization.
  • Benchmark at Every Step: Intuition is not enough. Use perf_analyzer under realistic, concurrent loads to validate that your changes are having the intended effect and not causing unintended regressions.
  • Model Memory as a First-Class Citizen: Proactively calculate memory footprints for your max_batch_size and instance counts to prevent production OOM errors.
  • By applying these advanced patterns, you can transform an underperforming, contentious inference server into a highly efficient, multi-tenant platform that extracts maximum value from every GPU cycle.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles