Triton Multi-Model GPU Sharing via Dynamic Batching & Priorities

12 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Underutilization Dilemma in GPU Inference

In modern MLOps, deploying a single model per GPU is a luxury that few production environments can afford. A state-of-the-art NVIDIA A100 or H100 GPU possesses computational power and memory bandwidth that often far exceeds the requirements of a single, moderately-sized model like a ResNet-50 or even a DistilBERT under typical load. This leads to a critical problem: significant underutilization of expensive hardware, with GPU utilization metrics frequently languishing below 50%. The business implication is a direct inflation of Total Cost of Ownership (TCO) for ML inference services.

The strategic objective is clear: co-locate multiple models on a single GPU to maximize hardware utilization and ROI. However, this introduces complex resource contention challenges. How do you prevent a high-throughput, batch-oriented NLP model from starving a latency-sensitive, real-time computer vision model? How do you manage GPU memory and schedule execution to meet disparate Service Level Objectives (SLOs) for each model?

This is where a nuanced understanding of the NVIDIA Triton Inference Server's advanced scheduling and execution control features becomes non-negotiable. We will bypass introductory concepts and dive directly into the orchestration of dynamic batching, instance groups, and scheduling priorities—the three pillars of efficient multi-model GPU sharing. This guide assumes you are already familiar with Triton's basic architecture and the config.pbtxt file.

Triton's Concurrency Primitives: A Refresher for Experts

Before we combine them, let's briefly recalibrate our understanding of the key mechanisms. While you know what they are, their interaction is where the complexity lies.

  • instance_group: This stanza in config.pbtxt defines how many instances of a model are loaded. Each instance can process one inference request (which could be a batch) at a time. A common misconception is that more instances always equal more throughput. In reality, each instance consumes GPU memory and creates its own execution context. On a single GPU, you're not achieving true parallelism but rather concurrency, allowing the GPU to switch between contexts. Setting count too high can lead to context-switching overhead and memory saturation.
  • dynamic_batching: This is Triton's scheduler-level mechanism for grouping individual inference requests into a single batch before sending them to a model instance. This is the primary driver for maximizing computational efficiency, as GPUs are designed for parallel operations on large tensors. The key is that this batch formation happens before the request is handed to a model instance.
  • The critical insight is that these two features work in concert. The dynamic batcher forms a batch, and the scheduler then hands that batch to one of the available model instances defined by instance_group. Our goal is to tune both to orchestrate a multi-model environment.

    Deep Dive: Advanced Dynamic Batching Configuration

    The dynamic_batching block is more than just setting max_batch_size. Its real power lies in the fine-tuning parameters that dictate the trade-off between latency and throughput.

    Let's analyze a sophisticated configuration:

    protobuf
    # config.pbtxt for a ResNet-50 model
    name: "resnet50_vision"
    platform: "tensorrt_plan"
    max_batch_size: 64
    
    dynamic_batching {
      preferred_batch_size: [8, 16, 32]
      max_queue_delay_microseconds: 5000
      default_queue_policy {
        timeout_action: REJECT
        default_timeout_microseconds: 10000000
        allow_timeout_override: false
      }
    }

    `preferred_batch_size`

    This is arguably the most important and underutilized parameter. Instead of waiting until max_batch_size is reached or the delay expires, the scheduler will attempt to form a batch of a preferred size as soon as possible. Providing an array [8, 16, 32] gives the scheduler multiple opportunistic targets.

    * How it works: If 9 requests are in the queue, the scheduler can immediately form a batch of 8 and dispatch it, rather than waiting for more requests or for the max_queue_delay_microseconds timer to expire. This dramatically reduces latency under moderate load.

    * Production Pattern: Analyze your model's performance with different batch sizes using triton-model-analyzer. Identify the "sweet spots" where throughput increases significantly (e.g., powers of 2 are common for TensorRT engines). Use these as your preferred sizes. For a ResNet50, performance might jump at sizes 8, 16, and 32, making them ideal candidates.

    `max_queue_delay_microseconds`

    This parameter directly controls your latency/throughput trade-off. It's the maximum time the scheduler will wait to form a preferred batch before dispatching whatever is in the queue (up to max_batch_size).

    * Low Value (e.g., 1000µs): Prioritizes latency. The scheduler waits very little time to accumulate requests, resulting in smaller batches but faster response times for individual requests. Ideal for real-time, user-facing applications.

    * High Value (e.g., 20000µs): Prioritizes throughput. The scheduler waits longer, allowing more requests to accumulate into larger, more efficient batches. This increases overall throughput at the cost of higher per-request latency. Ideal for offline or batch processing tasks.

    Setting this requires a deep understanding of your SLOs. A value of 5000 (5ms) is a reasonable starting point for services that need to be responsive but can benefit from batching.

    The Multi-Model Contention Problem

    Now, let's introduce a second model onto our GPU: a BERT-Large model for NLP tasks. BERT is significantly more computationally and memory-intensive than ResNet-50.

    If we load both with default configurations, we'll immediately see problems:

  • Memory Contention: Both models' weights and intermediate activation tensors must fit in VRAM. If their combined footprint exceeds available memory, one will fail to load.
  • Execution Contention: A single large batch inference from BERT can occupy the GPU's Streaming Multiprocessors (SMs) for hundreds of milliseconds. During this time, any incoming, latency-sensitive requests for the ResNet model will be queued, waiting for the GPU to become free, violating their SLOs.
  • This is where simple dynamic batching fails. We need more granular control.

    Solution 1: Capping Resource Usage with `instance_group`

    We can use instance_group to limit the concurrent executions of our resource-heavy BERT model, effectively putting a cap on its ability to monopolize the GPU.

    Consider this configuration for the BERT model:

    protobuf
    # config.pbtxt for a BERT-Large model
    name: "bert_nlp"
    platform: "onnxruntime_onnx"
    max_batch_size: 32
    
    instance_group [
      {
        count: 2
        kind: KIND_GPU
      }
    ]
    
    dynamic_batching {
      preferred_batch_size: [16, 32]
      max_queue_delay_microseconds: 20000
    }

    By setting count: 2, we are telling Triton to load only two instances of the BERT model. This means that at any given moment, a maximum of two batches of BERT inferences can be running on the GPU concurrently. Even if 100 requests arrive for BERT simultaneously, and the dynamic batcher forms three batches of 32, only two will be dispatched to the model instances. The third will wait until one of the first two completes. This acts as a backpressure mechanism, preventing BERT from overwhelming the GPU and leaving execution slots open for other models like our ResNet.

    Solution 2: Orchestrating Execution with Model Priorities

    Limiting concurrency is a good defensive measure, but it doesn't solve the problem of which model's request to process first. If requests for both ResNet and BERT are waiting in the scheduler's queue, which one gets picked?

    This is controlled by the optimization policy and priority_levels. We can assign a priority to each model, forcing the Triton scheduler to service high-priority models before low-priority ones.

    First, you must enable priorities at the server level by adding a command-line argument to Triton:

    bash
    --model-control-mode=explicit --load-model=* --backend-config=grpc,priority-levels=3

    This command tells Triton to enable a scheduler with 3 priority levels, where 1 is the highest priority.

    Next, we configure the priorities in each model's config.pbtxt.

    ResNet-50 (High Priority):

    protobuf
    # config.pbtxt for resnet50_vision
    name: "resnet50_vision"
    ...
    dynamic_batching { ... }
    
    optimization { 
      policy { 
        priority { 
          priority_level: 1 
        } 
      } 
    }

    BERT-Large (Low Priority):

    protobuf
    # config.pbtxt for bert_nlp
    name: "bert_nlp"
    ...
    dynamic_batching { ... }
    
    optimization { 
      policy { 
        priority { 
          priority_level: 3 
        } 
      } 
    }

    With this configuration, if the scheduler has a choice between forming a batch for ResNet and a batch for BERT, it will always choose the ResNet request, as long as one is available. The BERT requests will only be processed when the queue for priority 1 (and 2, if used) is empty. This is a powerful tool for guaranteeing SLOs for your most critical, latency-sensitive models.

    Edge Case: Priority Starvation: A constant stream of high-priority requests can completely starve low-priority models. To mitigate this, you can set a default_timeout_microseconds in the default_queue_policy of the low-priority model. If a request for the BERT model waits longer than this timeout, Triton will reject it with an error instead of letting it wait indefinitely. This prevents infinite queue buildup and provides a clear signal to the client that the system is overloaded.

    Production Scenario: A Complete Hybrid Workload Configuration

    Let's put everything together in a realistic scenario. We are tasked with deploying two models on a single NVIDIA A10G GPU:

  • fraud_detection_model (based on ResNet-34): Must have a p99 latency below 75ms. Receives a steady stream of individual requests.
  • document_summarizer_model (based on T5-base): A heavy transformer model. Throughput is the primary goal; latency is secondary. Used for asynchronous batch processing.
  • Step 1: Analyze and Profile

    First, use triton-model-analyzer to profile each model in isolation to understand its memory footprint and performance characteristics at different batch sizes.

    * Fraud Model: We find it uses ~1.5GB of VRAM and its performance sweet spots are at batch sizes 4, 8, and 16. Inference time is ~20ms at batch size 8.

    * Summarizer Model: We find it's a memory hog at ~6GB of VRAM. Its throughput scales almost linearly up to batch size 32. Inference time is ~400ms at batch size 16.

    The combined VRAM usage is ~7.5GB, which fits comfortably on an A10G (24GB VRAM), leaving room for activations and CUDA contexts.

    Step 2: Configure Triton Server

    Launch Triton with priority levels enabled:

    bash
    tritonserver --model-repository=/models --model-control-mode=explicit --load-model=* --backend-config=grpc,priority-levels=2

    We'll use two priority levels.

    Step 3: Configure the High-Priority `fraud_detection_model`

    We prioritize low latency and high availability.

    protobuf
    # /models/fraud_detection_model/config.pbtxt
    name: "fraud_detection_model"
    platform: "tensorrt_plan"
    max_batch_size: 16
    
    input [ ... ]
    output [ ... ]
    
    instance_group [
      {
        count: 4
        kind: KIND_GPU
      }
    ]
    
    dynamic_batching {
      preferred_batch_size: [4, 8]
      max_queue_delay_microseconds: 2000 # 2ms delay to catch co-arriving requests
    }
    
    optimization { 
      policy { 
        priority { 
          priority_level: 1 # Highest priority
        } 
      } 
    }

    * instance_group count of 4: Allows us to handle up to 4 concurrent batches, ensuring we can service bursts of traffic.

    * max_queue_delay_microseconds: 2000: A very short delay. We are sacrificing some batching potential for minimal latency.

    * priority_level: 1: Guarantees these requests are picked first by the scheduler.

    Step 4: Configure the Low-Priority `document_summarizer_model`

    We prioritize high throughput.

    protobuf
    # /models/document_summarizer_model/config.pbtxt
    name: "document_summarizer_model"
    platform: "onnxruntime_onnx"
    max_batch_size: 32
    
    input [ ... ]
    output [ ... ]
    
    instance_group [
      {
        count: 1 # Limit concurrent runs to prevent GPU monopolization
        kind: KIND_GPU
      }
    ]
    
    dynamic_batching {
      preferred_batch_size: [8, 16, 32]
      max_queue_delay_microseconds: 50000 # 50ms, wait longer to form large batches
    }
    
    optimization { 
      policy { 
        priority { 
          priority_level: 2 # Lowest priority
        } 
      } 
    }

    * instance_group count of 1: This is the crucial limiter. We only allow one batch of this heavy model to execute at a time, leaving GPU resources free for the fraud model.

    * max_queue_delay_microseconds: 50000: A long delay to maximize the chance of forming large, efficient batches.

    * priority_level: 2: Ensures it only runs when no fraud detection requests are pending.

    Benchmarking the Optimized Configuration

    Validating this setup is non-negotiable. We use Triton's perf_analyzer to simulate a realistic, mixed workload. We will run two perf_analyzer instances concurrently, one for each model.

    Terminal 1: Simulate high-concurrency traffic to the fraud model

    bash
    perf_analyzer -m fraud_detection_model -u localhost:8001 --concurrency-range 32 -p 5000

    This command sends requests to the fraud model, ramping up to a concurrency of 32, and runs for 5 seconds between measurements.

    Terminal 2: Simulate a steady batch workload for the summarizer model

    bash
    perf_analyzer -m document_summarizer_model -u localhost:8001 --concurrency-range 16 -p 5000

    While these are running, monitor GPU usage with watch -n 0.5 nvidia-smi.

    Expected Results & Analysis

    Configurationfraud_model p99 Latencysummarizer_model ThroughputGPU UtilizationObservations
    Naive (Defaults)280ms45 inf/sec70%High latency on the fraud model. The summarizer model's long inference time blocks it. GPU util is spiky.
    Optimized (Tuned)68ms82 inf/sec95%Fraud model SLO is met. Summarizer throughput nearly doubles. GPU utilization is high and stable.

    The results from the optimized configuration demonstrate success. The p99 latency for the critical fraud model is well within our 75ms SLO, even under heavy concurrent load from the summarizer. Simultaneously, the summarizer's throughput has increased significantly because its longer queue delay allows it to form larger, more efficient batches during the small gaps between fraud detection requests. Most importantly, GPU utilization is consistently high, proving we are maximizing the value of our hardware.

    Final Gotchas and Advanced Considerations

    Backend-Specific Batching: The examples above are generic, but the underlying model backend matters immensely. A TensorRT engine (.plan file) must* be built with an optimization profile that supports the range of batch sizes you specify in dynamic_batching. If you configure max_batch_size: 32 but your TensorRT engine was only optimized for batch sizes 1 and 8, Triton will throw an error. This is a common and frustrating deployment issue.

    * CPU Bottlenecks: In models with significant pre/post-processing logic (e.g., image resizing, text tokenization), the CPU can become the bottleneck. Even with a perfectly tuned GPU schedule, if the CPU can't prepare the data fast enough, the GPU will sit idle. Always profile the entire end-to-end latency. Consider using Triton's Python or C++ backends to move pre/post-processing logic onto the server to reduce network overhead and potentially execute it on the CPU in parallel with GPU inference.

    * CUDA Contexts: Every model instance creates a CUDA context, which consumes VRAM. A configuration with many models and many instances per model can fail due to VRAM exhaustion from contexts alone, even before model weights are loaded. Be judicious with the count parameter in instance_group.

    By moving beyond default settings and strategically combining dynamic batching, instance groups, and priority levels, you can transform a collection of disparate models into a highly efficient, cooperative, multi-tenant inference system. This level of tuning is no longer optional; it is a core competency for building scalable and cost-effective AI services in production.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles