Optimizing Triton for Multi-Model GPU Sharing via Dynamic Batching

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The High Cost of Underutilized GPUs in Production ML

In any production machine learning system, the cost of inference is dominated by the underlying hardware, typically high-end GPUs like NVIDIA's A100 or H100 series. A common, yet financially inefficient, deployment pattern is the one-to-one mapping: one model per GPU. While simple to manage, this architecture often leads to abysmal resource utilization. Consider a scenario where a real-time recommendation model receives an average of 50 queries per second (QPS). Deploying this model on a dedicated A100 GPU, capable of handling thousands of inferences per second, results in GPU utilization hovering in the low single digits. The organization pays for 100% of a powerful accelerator but leverages only a fraction of its capacity.

This inefficiency is compounded in systems with a diverse portfolio of models. You might have a high-traffic, low-latency computer vision model alongside a low-traffic, batch-oriented NLP model for document analysis. Dedicating a GPU to each is operationally untenable. The core challenge for senior MLOps and ML Systems engineers is to maximize the throughput and, by extension, the return on investment (ROI) of this expensive hardware.

The solution lies in intelligent model co-location and dynamic request batching. NVIDIA's Triton Inference Server is engineered specifically for this purpose, providing sophisticated mechanisms to run multiple models concurrently on a single GPU and dynamically group incoming requests into optimal batches. This article is a deep dive into the advanced configuration and optimization patterns required to achieve this, moving far beyond the "hello world" examples to address production-grade challenges.


Triton's Architecture for Concurrent Model Execution

To effectively co-locate models, we must first understand the key architectural components in Triton that enable concurrency and resource management. We'll bypass the basics of model repositories and focus on the two pillars of our optimization strategy: Instance Groups and the Dynamic Batching Scheduler.

Instance Groups (`instance_group`)

At its core, an "instance" of a model in Triton is a loaded copy of the model's computational graph ready to execute inferences on a specific processing unit (CPU or GPU). The instance_group configuration in a model's config.pbtxt file is the primary directive for controlling where and how many instances of a model are created.

For our goal of multi-model co-location, the critical parameters are kind and gpus.

  • kind: Specifies the type of processor. KIND_GPU directs Triton to load the model instance onto a GPU, while KIND_CPU places it on the CPU.
  • count: Defines how many instances of this model to load for this group.
  • gpus: An array of GPU device IDs where these instances should be placed. For example, gpus: [0, 1] would instruct Triton to create count instances on GPU 0 and another count instances on GPU 1.
  • Example: A basic config.pbtxt for a single model on GPU 0

    protobuf
    name: "resnet50_vision"
    platform: "onnxruntime_onnx"
    max_batch_size: 64
    
    input [
      {
        name: "input_1"
        data_type: TYPE_FP32
        dims: [ 3, 224, 224 ]
      }
    ]
    
    output [
      {
        name: "fc1000"
        data_type: TYPE_FP32
        dims: [ 1000 ]
      }
    ]
    
    # Direct Triton to load one instance of this model onto GPU with device ID 0.
    instance_group [
      {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ]
      }
    ]

    By creating config.pbtxt files for multiple different models and pointing them all to gpus: [0], we instruct Triton to load all of them into the VRAM of a single GPU. Triton's scheduler will then manage the execution of inference requests for these different models on the shared hardware, forming the basis of our optimization.


    Implementing Multi-Model Co-location on a Single GPU

    Let's architect a concrete, production-style scenario. We need to deploy two distinct models on a single NVIDIA T4 GPU:

  • resnet50_vision: A computer vision model for image classification, latency-sensitive.
  • bert_nlp: A BERT-base model for sentiment analysis, less latency-sensitive but computationally heavy.
  • Our goal is to have both models actively serving traffic from the same GPU, sharing its computational resources.

    Step 1: Model Repository Structure

    First, we establish the standard Triton model repository layout. This structure is non-negotiable for Triton to discover and load the models correctly.

    bash
    /models
    ├── resnet50_vision
    │   ├── 1
    │   │   └── model.onnx
    │   └── config.pbtxt
    └── bert_nlp
        ├── 1
        │   └── model.onnx
        └── config.pbtxt

    Step 2: Configuring for Co-location

    The magic happens within the config.pbtxt files. We will explicitly tell Triton to load one instance of each model onto GPU 0.

    models/resnet50_vision/config.pbtxt

    protobuf
    name: "resnet50_vision"
    platform: "onnxruntime_onnx"
    max_batch_size: 64
    
    input [
      {
        name: "input_1"
        data_type: TYPE_FP32
        dims: [ 3, 224, 224 ]
      }
    ]
    
    output [
      {
        name: "fc1000"
        data_type: TYPE_FP32
        dims: [ 1000 ]
      }
    ]
    
    # Pin this model to GPU 0
    instance_group [
      {
        count: 1,
        kind: KIND_GPU,
        gpus: [ 0 ]
      }
    ]

    models/bert_nlp/config.pbtxt

    protobuf
    name: "bert_nlp"
    platform: "onnxruntime_onnx"
    max_batch_size: 32
    
    input [
      {
        name: "input_ids",
        data_type: TYPE_INT64,
        dims: [ -1, -1 ] # -1 indicates variable dimension for batch size and sequence length
      },
      {
        name: "attention_mask",
        data_type: TYPE_INT64,
        dims: [ -1, -1 ]
      }
    ]
    
    output [
      {
        name: "last_hidden_state",
        data_type: TYPE_FP32,
        dims: [ -1, -1, 768 ]
      }
    ]
    
    # Pin this model to the SAME GPU 0
    instance_group [
      {
        count: 1,
        kind: KIND_GPU,
        gpus: [ 0 ]
      }
    ]

    When we launch Triton pointing to this model repository, it will parse these configurations and load both resnet50_vision and bert_nlp into the VRAM of the GPU with device ID 0. We can verify this using nvidia-smi.

    bash
    # Launch Triton Server
    docker run --gpus all -it --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/models
    
    # In another terminal, check GPU memory
    nvidia-smi

    You will see a tritonserver process holding GPU memory, and the total memory consumed will be the sum of the Triton context overhead plus the VRAM footprint of both models.

    VRAM Management and OOM Pitfalls

    This co-location strategy is powerful but introduces the primary risk of Out Of Memory (OOM) errors. A senior engineer must proactively manage VRAM allocation. Before co-locating models, you must estimate their memory footprint. A practical approach is to load each model individually into Triton and record its memory usage via nvidia-smi.

    Example VRAM Calculation:

    • Triton Context + CUDA overhead: ~1.5 GB
  • resnet50_vision (FP32): ~102 MB
  • bert_nlp (FP32): ~438 MB
  • Intermediate activation memory (per request, depends on max_batch_size): Varies, can be significant.
  • Total Estimated VRAM > 2 GB

    If the sum of your models' memory requirements exceeds the available VRAM, Triton will fail to load the second model, logging a CUDA OOM error. The mitigation strategy is to either use a GPU with more VRAM or, more commonly, to quantize models (e.g., to FP16 or INT8) to reduce their memory footprint, a critical step in production ML deployment.


    The Power of Dynamic Batching: From Theory to `config.pbtxt`

    Simply co-locating models is only half the battle. Without batching, the GPU will still be underutilized, processing single inference requests sequentially. This is where the Dynamic Batching Scheduler becomes our most critical tool for maximizing throughput.

    The scheduler's job is to intercept incoming, individual inference requests and intelligently group them into a larger batch before sending them to the model instance on the GPU. This is fundamentally a latency vs. throughput trade-off: we intentionally delay individual requests for a short period, hoping that more requests will arrive within that window to form a larger, more efficient batch.

    Let's enhance our bert_nlp model configuration to use dynamic batching.

    models/bert_nlp/config.pbtxt (with Dynamic Batching)

    protobuf
    name: "bert_nlp"
    # ... (inputs/outputs are the same)
    
    instance_group [
      {
        count: 1,
        kind: KIND_GPU,
        gpus: [ 0 ]
      }
    ]
    
    # Add the dynamic_batching stanza
    dynamic_batching {
      preferred_batch_size: [8, 16]
      max_queue_delay_microseconds: 5000 # 5 milliseconds
    }

    Deep Dive into Dynamic Batching Parameters

  • max_batch_size (top-level): This is a hard limit. The backend and model graph must be able to handle a batch of this size. It's a memory and compute constraint.
  • preferred_batch_size: This is the most important tuning parameter for performance. It provides hints to the scheduler. When forming a batch, the scheduler will prioritize creating batches of these sizes. Why an array? GPUs often have optimal performance at specific batch sizes (e.g., powers of 2 that align with CUDA core configurations). The scheduler will try to create a batch of size 16. If it can't fill a batch of 16 within the delay window, it will try for a batch of 8. If that also fails, it will create a smaller, non-preferred batch.
  • max_queue_delay_microseconds: This is the latency-throughput knob. It defines the maximum time an individual request will wait in the queue to be batched with others. A higher value (e.g., 100000 for 100ms) allows the scheduler to wait longer, increasing the probability of forming a large, preferred-size batch, thus maximizing throughput. However, this adds directly to the p99 latency. For a real-time API with a 50ms SLO, you might set this to 2000 (2ms). For an offline batch processing job, you could set it to 500000 (500ms) to ensure maximum batching efficiency.

  • Performance Benchmarking and Analysis

    Theory is insufficient. We must rigorously benchmark our configuration to prove its efficacy. Triton's Performance Analyzer (perf_analyzer) is the standard tool for this.

    We will simulate concurrent client traffic to both models and measure the system's performance with and without dynamic batching.

    The Benchmarking Setup

    We'll use perf_analyzer to send a constant stream of requests to both models simultaneously.

    Terminal 1: Send traffic to resnet50_vision

    bash
    # --concurrency-range 16 means 16 concurrent clients sending requests
    perf_analyzer -m resnet50_vision --concurrency-range 16 -u localhost:8001

    Terminal 2: Send traffic to bert_nlp

    bash
    perf_analyzer -m bert_nlp --concurrency-range 16 -u localhost:8001 --input-data /path/to/bert_input_data.json

    Terminal 3: Monitor GPU Utilization

    bash
    # dmon prints scrolling device stats every second
    nvidia-smi dmon -s u -d 1

    Scenario 1: Co-location WITHOUT Dynamic Batching

    First, we run the benchmark with the initial config.pbtxt files (no dynamic_batching stanza).

    Expected Results:

  • resnet50_vision Throughput: ~350 inferences/sec
  • bert_nlp Throughput: ~120 inferences/sec
  • GPU Utilization: Fluctuating, but likely averaging around 40-50%.
  • Latency: Low, as requests are processed immediately upon arrival.
  • Scenario 2: Co-location WITH Dynamic Batching

    Now, we add the dynamic_batching configuration to both models' config.pbtxt files. We will set a max_queue_delay_microseconds of 10000 (10ms) to allow batches to form.

    resnet50_vision/config.pbtxt (additions):

    protobuf
    dynamic_batching {
      preferred_batch_size: [16, 32]
      max_queue_delay_microseconds: 10000
    }

    bert_nlp/config.pbtxt (additions):

    protobuf
    dynamic_batching {
      preferred_batch_size: [8, 16]
      max_queue_delay_microseconds: 10000
    }

    After restarting Triton to apply the new configuration, we re-run the exact same perf_analyzer commands.

    Expected Results:

    MetricWithout BatchingWith Dynamic BatchingImprovement
    resnet50_vision (inf/sec)~350~950+171%
    bert_nlp (inf/sec)~120~280+133%
    Avg. GPU Utilization~45%~85-95%+100%
    p99 Latency (ms)~5ms~15ms(Trade-off)

    Analysis:

    The results are stark. By allowing Triton to wait a mere 10ms, we enabled the formation of larger, more efficient batches. This dramatically increased the total number of inferences processed per second across both models, pushing GPU utilization towards its practical maximum. The total system throughput more than doubled. The cost was a predictable and controlled increase in p99 latency, which remains well within the SLO for many applications.


    Advanced Edge Cases and Production Patterns

    Real-world systems present complexities beyond simple benchmarks. Here's how to handle them.

    Edge Case 1: Variable-Sized Inputs (Ragged Batches)

    Our bert_nlp model has dims: [ -1, -1 ], indicating a variable sequence length. When Triton batches requests for this model, it will encounter tensors of different shapes (e.g., one request with sequence length 20, another with 50). This creates a "ragged batch."

    Solution: The backend (ONNX Runtime, TensorRT) must support this. For ONNX Runtime, this is handled automatically by padding the smaller tensors in the batch to match the largest one. However, this introduces wasted computation on the padding tokens.

    For optimal performance, especially with Transformer models, this requires using a backend that is more aware of the sequence nature of the data. This is where Triton's Sequence Batching Scheduler comes in. While a full implementation is beyond this article's scope, it involves defining a sequence_batching stanza and using control tensors to signal the start and end of sequences. This allows Triton to batch requests belonging to the same sequence correlation ID, a pattern crucial for stateful models like LSTMs or when dealing with conversational AI.

    Edge Case 2: Model Prioritization

    What if the resnet50_vision model serves a critical, user-facing API, while bert_nlp is for a lower-priority internal analytics task? If both models are under heavy load, we don't want the BERT model's heavy computations to starve the ResNet model and increase its latency.

    Solution: Triton's Model Priority feature.

    We can add a priority_levels field to the main server configuration and assign priorities within each model's config.pbtxt.

    1. Create a server_config.txt file:

    text
    model-control-mode: EXPLICIT
    model-load-thread-count: 2
    model-priority {
      key: "resnet50_vision",
      value: {
        priority: 1 # Higher number = higher priority
      }
    }
    model-priority {
      key: "bert_nlp",
      value: {
        priority: 2
      }
    }

    2. Launch Triton with this config:

    bash
    tritonserver --model-repository=/models --model-control-mode=explicit --load-model=resnet50_vision --load-model=bert_nlp --repository-poll-secs=0 --model-config=/path/to/server_config.txt

    With this configuration, when Triton's scheduler has a choice between processing a batch for ResNet or a batch for BERT, it will always dequeue and execute the ResNet batch first, ensuring your critical service maintains its latency SLOs even under contention.


    Conclusion: A Systems-Level Approach to Inference Optimization

    Maximizing the efficiency of production ML inference is not a model-level problem; it is a systems engineering challenge. By moving away from the simplistic one-model-per-GPU pattern and embracing the advanced features of a sophisticated serving platform like Triton, we can achieve dramatic improvements in hardware utilization and cost-effectiveness.

    Key Takeaways for Senior Engineers:

  • Co-locate Intelligently: Use instance_group to place multiple models on a single GPU, but always pre-calculate VRAM footprints to prevent OOM errors.
  • Master Dynamic Batching: Dynamic batching is your primary tool for throughput. The trade-off between max_queue_delay_microseconds and preferred_batch_size must be tuned based on your specific traffic patterns and latency SLOs.
  • Benchmark Everything: Never assume a configuration is optimal. Use perf_analyzer to rigorously test your changes under simulated production load and validate performance gains.
  • Plan for Heterogeneity: Real-world systems involve models with different priorities and input structures. Leverage advanced features like model prioritization and be aware of the implications of ragged batching to build a robust and fair scheduling environment.
  • By applying these production-grade patterns, you can transform your ML inference platform from a costly, underutilized resource into a highly efficient, high-throughput system capable of scaling with your organization's needs.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles