Optimizing Triton for Multi-Model GPU Sharing with Dynamic Batching
The Core Problem: Stranded Assets in Production ML Serving
In any mature ML production environment, the cost of inference hardware, particularly high-end GPUs, represents a significant portion of operational expenditure. The naive approach of dedicating one GPU per deployed model is financially untenable and technically inefficient. The common scenario involves a fleet of diverse models—perhaps a BERT-based model for NLP tasks, a ResNet for image classification, and a custom GNN for fraud detection—all needing to be served concurrently.
The primary challenge is that these models have heterogeneous resource profiles. A Transformer model might be memory-intensive and benefit from larger batches, while a lightweight CNN used for a real-time feature might be latency-sensitive and require immediate execution. Co-locating them on a single GPU without a sophisticated management layer results in a chaotic free-for-all:
This is precisely the problem domain where NVIDIA's Triton Inference Server excels. Assuming a working knowledge of Triton's basic architecture, we will bypass introductory concepts and dive directly into the advanced configuration mechanics required to solve this multi-tenant hardware sharing problem efficiently and deterministically.
Level 1: Concurrency Within a Single Model via `instance_group`
Before we can effectively share a GPU between different models, we must first ensure we can saturate it with requests for a single model. This is the role of the instance_group configuration in config.pbtxt. Triton can load multiple instances of the same model into GPU memory, allowing them to process requests in parallel on different CUDA streams.
Consider a standard ResNet50 model. A naive configuration might omit the instance_group block entirely, defaulting to a single instance.
# models/resnet50/config.pbtxt (Sub-optimal)
name: "resnet50_v1_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 128
input [
  {
    name: "input_1"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "dense_1"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]Under heavy load, a single instance becomes a bottleneck. Even with batching, the GPU's Streaming Multiprocessors (SMs) may be underutilized as they wait for a single model execution to complete. By specifying multiple instances, we instruct Triton to parallelize execution.
# models/resnet50/config.pbtxt (Optimized for Concurrency)
name: "resnet50_v1_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 128
# ... input/output definitions ...
# Key addition for parallelism
instance_group [
  {
    count: 4
    kind: KIND_GPU
    gpus: [0] # Explicitly assign to GPU 0
  }
]Implementation Details & Edge Cases:
*   How many instances? The ideal count is not arbitrary. It's a function of the model's size, its computational graph, and the GPU architecture. A good starting point is 2-4 instances for compute-bound models. The goal is to have enough concurrent executions to keep the GPU's SMs fed. You can determine the upper bound by observing GPU memory usage; loading too many instances will lead to OOM errors.
   Memory Overhead: Each instance consumes its own slice of GPU memory for model weights and intermediate activation tensors. 4  model_size must fit comfortably on the GPU, leaving room for other models.
*   KIND_GPU vs. KIND_CPU: While KIND_CPU is an option, it's typically used for pre/post-processing logic or for models that don't benefit from GPU acceleration. For multi-model GPU sharing, we are exclusively concerned with KIND_GPU.
Configuring instance_group is the foundational step. It allows a single, high-traffic model to scale vertically on the GPU. Now, let's add the dimension of request batching.
Level 2: Dynamic Batching for Throughput Amplification
Executing requests one by one, even in parallel instances, is inefficient. GPUs are designed for massively parallel computation, and their performance shines when operating on large batches of data. Triton's dynamic batcher is the scheduler component that collects individual inference requests over a short time window and groups them into a single batch for execution.
This is where the critical trade-off between latency and throughput is managed.
# models/resnet50/config.pbtxt (With Dynamic Batching)
name: "resnet50_v1_onnx"
# ... platform, max_batch_size, inputs/outputs ...
instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]
dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 10000 # 10ms
}Let's dissect this dynamic_batching block. This is where senior engineers fine-tune performance.
*   max_queue_delay_microseconds: This is the most critical parameter. It defines the maximum time Triton will wait after receiving the first request before dispatching a batch. 
    *   A low value (e.g., 1000 for 1ms) prioritizes latency. Triton will form smaller batches more quickly, which is ideal for real-time applications but yields lower overall throughput.
    *   A high value (e.g., 20000 for 20ms) prioritizes throughput. Triton waits longer, accumulating more requests to form larger, more GPU-efficient batches. This increases P99 latency but dramatically boosts the total requests per second (RPS) the server can handle.
*   preferred_batch_size: This is a powerful and often misunderstood setting. Instead of a single value, you provide an array of ideal batch sizes. The Triton scheduler will attempt to form a batch of one of these sizes. If, after the max_queue_delay expires, the queue contains 17 requests, Triton will form a batch of 16 (the closest preferred size) and leave 1 request in the queue to start the next batching window. This prevents the formation of awkwardly sized batches (e.g., 17) that might not align well with the GPU's core count or memory layout, a phenomenon known as the "tail effect."
Performance Scenario: Tuning max_queue_delay
Imagine running a load test against our ResNet50 model using Triton's perf_analyzer.
max_queue_delay_microseconds: 500 (0.5ms)* Result: P99 Latency: 15ms, Throughput: 450 RPS, GPU Utilization: 45-55% (spiky).
* Analysis: The batcher is too impatient. It sends small, inefficient batches to the GPU, which cannot stay saturated.
max_queue_delay_microseconds: 10000 (10ms)* Result: P99 Latency: 40ms, Throughput: 1200 RPS, GPU Utilization: 90-95% (stable).
* Analysis: We've accepted higher latency for a nearly 3x improvement in throughput. The GPU is now the bottleneck, not the scheduler, which is the desired state for a throughput-oriented task.
This tuning process is fundamental. Now, let's apply these principles to the multi-model challenge.
Level 3: Orchestrating Heterogeneous Models with Priorities and Rate Limiters
Here is our production scenario: Serve a latency-sensitive BERT model for sentiment analysis and the throughput-oriented ResNet50 model on the same NVIDIA A10G GPU.
* BERT Model: Needs P99 latency < 50ms. Traffic is bursty.
* ResNet50 Model: Used for offline image tagging. Throughput is key; latency is less critical.
If we deploy both with default settings, a large influx of ResNet requests could form a massive batch that occupies the GPU for 100ms, causing any concurrent BERT requests to wait, violating their SLA.
We need to tell Triton's scheduler about our business priorities. This is done via two advanced, server-level and model-level configurations.
Step 1: Enable Priority Scheduling at the Server Level
You must explicitly enable the priority scheduler when launching the Triton server. This is done with a command-line flag.
# Launching Triton with priority scheduling enabled for 2 priority levels
tritonserver --model-repository=/models --model-control-mode=explicit --load-model=bert --load-model=resnet50 --inference-request-priority-levels=2Step 2: Assign Priorities in config.pbtxt
Now, we assign a priority level to each model. Lower numbers indicate higher priority.
models/bert/config.pbtxt
name: "bert_sentiment"
platform: "onnxruntime_onnx"
max_batch_size: 64
# ... inputs/outputs ...
instance_group [ { count: 2, kind: KIND_GPU } ]
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 4000 # 4ms - prioritize low latency
}
# HIGHEST PRIORITY
default_priority_level: 1 models/resnet50/config.pbtxt
name: "resnet50_v1_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 128
# ... inputs/outputs ...
instance_group [ { count: 2, kind: KIND_GPU } ]
dynamic_batching {
  preferred_batch_size: [16, 32, 64]
  max_queue_delay_microseconds: 15000 # 15ms - prioritize throughput
}
# LOWER PRIORITY
default_priority_level: 2With this configuration, when both models have requests queued, the Triton scheduler will always form and dispatch a batch for the BERT model before it considers dispatching one for the ResNet model. This ensures our latency-sensitive model gets first access to the GPU, even if the lower-priority model has a full batch ready to go.
Edge Case: Starvation and the Need for Rate Limiting
Priorities solve the preemption problem, but they can introduce a new one: starvation. If the high-priority BERT model experiences a constant, high-volume stream of traffic, it could theoretically monopolize the GPU, and the lower-priority ResNet model might never get a chance to run.
This is where the rate_limiter configuration becomes a crucial tool for ensuring fairness.
The rate limiter works by allocating a certain number of "execution slots" to a model. A model instance can only begin processing a request if it can secure a resource slot.
Let's refine the ResNet50 configuration to prevent it from overwhelming the system, even if it had the chance.
# models/resnet50/config.pbtxt (With Rate Limiting)
# ... all previous settings ...
rate_limiter {
  resources [
    {
      name: "gpu_execution_accelerator"
      count: 1
    }
  ]
  priority: 2
}In this setup, we're defining a specific resource called gpu_execution_accelerator. By setting its count to 1 within this model's config, we are stating that out of all the resnet50 instances (we configured 2), only one can be executing on the GPU at any given time. The other instance can be pre-processing or waiting, but it cannot start its core CUDA kernel execution until the first one finishes and releases the resource slot. This acts as a concurrency cap within a specific model, guaranteeing that it can't hog all parallel execution resources on the GPU, leaving them free for other models like BERT.
Production-Grade Implementation: A Complete Example
Let's tie this all together into a reproducible Docker-based setup.
Directory Structure:
triton-production-repo/
├── models/
│   ├── bert_sentiment/
│   │   ├── 1/
│   │   │   └── model.onnx  # Placeholder for your actual model file
│   │   └── config.pbtxt
│   └── resnet50_v1_onnx/
│       ├── 1/
│       │   └── model.onnx  # Placeholder for your actual model file
│       └── config.pbtxt
└── DockerfileDockerfile:
# Use the official Triton container from NVIDIA's GPU Cloud
FROM nvcr.io/nvidia/tritonserver:23.10-py3
# Create and copy the model repository
WORKDIR /app
COPY ./models /models
# Expose the necessary ports
EXPOSE 8000 
EXPOSE 8001
EXPOSE 8002
# The command to run the server with our advanced scheduling options
CMD ["tritonserver", \
     "--model-repository=/models", \
     "--model-control-mode=explicit", \
     "--load-model=bert_sentiment", \
     "--load-model=resnet50_v1_onnx", \
     "--inference-request-priority-levels=2", \
     "--log-verbose=1"]models/bert_sentiment/config.pbtxt:
name: "bert_sentiment"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1, -1 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1, -1 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ -1, 2 ]
  }
]
instance_group [ { count: 2, kind: KIND_GPU, gpus: [0] } ]
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 4000
}
# Add CUDA graph optimization for models with static input shapes
optimization { cuda { graphs: true } }
default_priority_level: 1models/resnet50_v1_onnx/config.pbtxt:
name: "resnet50_v1_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 128
input [
  {
    name: "input_1"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "dense_1"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]
instance_group [ { count: 2, kind: KIND_GPU, gpus: [0] } ]
dynamic_batching {
  preferred_batch_size: [16, 32, 64]
  max_queue_delay_microseconds: 15000
}
# Note: CUDA graphs are less effective with dynamic batch sizes
# but can still provide a benefit if batch sizes are consistent.
optimization { cuda { graphs: true } }
default_priority_level: 2Monitoring and Tuning the Live System
Configuration is static; production traffic is not. The final piece of the puzzle is monitoring the system to validate your assumptions and tune parameters iteratively.
Triton exposes a Prometheus metrics endpoint on port 8002. Scraping this endpoint provides the ground truth for your system's performance.
Key Metrics to Dashboard and Alert On:
*   nv_gpu_utilization{gpu_uuid="..."}: Your primary indicator of success. For a multi-model server, this should be consistently high (>80%) under load.
*   nv_inference_queue_duration_us{model="..."}: The most important metric for tuning. This tells you how long requests are waiting in the dynamic batcher. If this value for your BERT model is creeping up and approaching your max_queue_delay_microseconds, you are on the verge of violating your latency SLA.
*   nv_inference_request_success{model="..."}: A basic counter for successful requests. A drop indicates server-side errors.
*   nv_inference_compute_duration_us{model="..."}: The time spent inside the actual model forward pass. Comparing this to nv_inference_queue_duration_us is critical. If queue_duration is much larger than compute_duration, your requests are spending more time waiting to be batched than they are running, which means your max_queue_delay might be too high for the traffic pattern.
*   nv_inference_pending_request_count{model="..."}: A direct measure of how many requests are currently in-flight or queued. A continuously increasing value indicates your server is overloaded and cannot keep up with the request rate.
The Iterative Tuning Loop:
nv_inference_queue_duration_us. Is it too high? Lower its max_queue_delay_microseconds.nv_gpu_utilization is low, your server is under-loaded or your batching is not aggressive enough. Consider increasing max_queue_delay_microseconds for the throughput-oriented model (ResNet).max_queue_delay or a preferred_batch_size, redeploy, and repeat the load test. This feedback loop is essential for honing in on the optimal configuration for your specific hardware and traffic patterns.Conclusion: From Configuration to Architecture
Effectively serving multiple models from a single GPU is not a simple matter of co-location. It is an architectural discipline that requires a deep understanding of the underlying hardware and the serving software's scheduling capabilities. We've moved beyond basic configurations to demonstrate a production-ready pattern using Triton's advanced features:
*   instance_group for intra-model parallelism.
*   dynamic_batching with carefully chosen preferred_batch_size and max_queue_delay to manage the latency/throughput trade-off.
* Priority levels to enforce business logic and SLAs at the scheduler level.
By combining these tools and establishing a robust monitoring and tuning loop, engineering teams can transform their GPU fleet from a collection of underutilized, high-cost assets into a highly efficient, multi-tenant compute fabric, maximizing both performance and ROI.