Scaling Triton on K8s: Dynamic Batching & Concurrent Model Patterns

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inefficiency of Naive GPU Serving

In a production Kubernetes environment, the default approach to serving a machine learning model is often a simple one-to-one mapping: one model, one pod, one GPU. While straightforward, this pattern is profoundly inefficient and costly. A single inference request, even for a complex model like BERT, might only utilize a high-end GPU like an A100 for a few milliseconds. The rest of the time, that expensive hardware sits idle, depreciating in value while consuming power. This underutilization is the silent killer of MLOps budgets.

The core challenge is that inference workloads are often latency-sensitive and arrive in stochastic bursts, making it difficult to saturate GPU resources without introducing unacceptable delays. NVIDIA's Triton Inference Server provides a sophisticated toolkit to solve this problem, but leveraging it effectively requires moving beyond the documentation's basic examples. This article dissects two of Triton's most powerful features—Dynamic Batching and Concurrent Model Execution—and demonstrates how to combine them into production-ready patterns on Kubernetes.

We will not cover the basics of setting up Triton or the NVIDIA GPU Operator. We assume you have a functioning Kubernetes cluster with GPU nodes and are looking to solve second-order performance and efficiency problems.

Deep Dive: Mastering Dynamic Batching for Throughput

Dynamic batching is Triton's server-side mechanism for transparently grouping individual inference requests into a larger batch before execution. This amortizes the overhead of kernel launches and data transfer, allowing the GPU to operate closer to its peak computational capacity. The trade-off is a small increase in latency for individual requests as they wait in a queue to form a batch.

Configuration and The Latency/Throughput Trade-off

The key to effective dynamic batching lies in the dynamic_batching stanza of a model's config.pbtxt.

Let's consider a ResNet50 model optimized with TensorRT. A naive configuration might look like this:

models/resnet50_fp16/config.pbtxt

pbtxt
name: "resnet50_fp16"
platform: "tensorrt_plan"
max_batch_size: 256
input [
  {
    name: "input_1"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "predictions"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

This configuration only sets max_batch_size. Without the dynamic_batching block, Triton will only batch requests that arrive simultaneously at the server, which is rare in practice. To enable server-side queueing, we introduce the dynamic_batching block.

models/resnet50_fp16/config.pbtxt (with Dynamic Batching)

pbtxt
name: "resnet50_fp16"
platform: "tensorrt_plan"
max_batch_size: 256
# ... input/output definitions ...
dynamic_batching {
  preferred_batch_size: [8, 16, 32, 64]
  max_queue_delay_microseconds: 5000
}

This is where the advanced tuning begins:

  • max_queue_delay_microseconds: This is the most critical parameter. It defines the maximum time an individual request will wait in the queue before the scheduler forms a batch, even if a preferred_batch_size hasn't been reached. A lower value (e.g., 1000 µs) prioritizes low latency at the expense of smaller, less efficient batches. A higher value (e.g., 20000 µs) maximizes throughput by allowing more time for larger batches to form, but increases p99 latency. Production Pattern: Do not guess this value. Benchmark your model under a realistic load profile. Start with a value around your p99 latency SLO (e.g., 5-10ms for a real-time API) and adjust based on performance testing.
  • preferred_batch_size: This gives hints to the scheduler. When the queue delay expires, Triton will try to form a batch of one of these sizes. If the number of requests in the queue is 40 and the delay expires, it will likely form a batch of 32 and leave 8 in the queue. This helps align batch sizes with GPU memory layouts and kernel optimizations (e.g., Tensor Cores are most efficient with batch sizes that are multiples of 8). Production Pattern: Analyze your model's performance at different batch sizes using perf_analyzer to determine the most efficient sizes for your specific GPU architecture.
  • Benchmarking the Impact

    Theory is insufficient; empirical data is required. Triton's perf_analyzer tool is essential for this.

    Let's compare performance with and without dynamic batching. We'll simulate 64 concurrent clients sending requests.

    Command (without dynamic batching):

    bash
    perf_analyzer -m resnet50_fp16 -u triton-server:8001 --concurrency-range 64

    Result (example):

    text
    *** Measurement Summary ***
      Concurrency: 64, throughput: 450 infer/sec, latency: 142000 usec

    Command (with dynamic batching, max_queue_delay_microseconds: 5000):

    bash
    perf_analyzer -m resnet50_fp16 -u triton-server:8001 --concurrency-range 64

    Result (example):

    text
    *** Measurement Summary ***
      Concurrency: 64, throughput: 1850 infer/sec, latency: 34500 usec

    The results are stark: a 4x increase in throughput and a 75% reduction in average latency. The latency reduction is counter-intuitive but critical to understand: without batching, 64 requests form a massive queue, and each is processed serially. With batching, requests are grouped, processed much more efficiently, and the overall time-to-last-byte for the entire set of requests is significantly lower.

    Edge Case: Handling Priority and Timeout

    What happens when the request queue is full or when some requests are more important than others? This is where queue policies come in.

    pbtxt
    dynamic_batching {
      preferred_batch_size: [16, 32]
      max_queue_delay_microseconds: 4000
      default_queue_policy {
        timeout_action: REJECT
        default_timeout_microseconds: 10000000 # 10 seconds
        allow_timeout_override: true
        max_queue_size: 512
      }
      priority_levels: 3
      default_priority_level: 2
    }

    * default_queue_policy: This block provides backpressure. max_queue_size prevents the server from being overwhelmed by requests, failing fast instead of accumulating infinite latency. timeout_action: REJECT will cause Triton to return an error for requests that wait too long, which is preferable to a client-side timeout.

    * priority_levels: This enables a multi-level priority queue. A request from a premium user could be sent with a priority of 1, while a background batch job could be sent with a priority of 3. Triton's scheduler will always service higher-priority batches first. This is a powerful feature for multi-tenant systems or services with mixed-criticality workloads.

    Concurrent Model Execution: Saturating a Single GPU

    Dynamic batching optimizes a single model. But what if you need to serve a dozen different models? Deploying a dozen pods with a dozen GPUs is financially ruinous. The goal is to co-locate multiple models on a single GPU.

    Triton achieves this via model instances. By default, Triton loads one instance of each model. You can increase this to run multiple copies of the same model in parallel or to run different models concurrently.

    The `instance_group` Configuration

    Let's configure a memory-intensive BERT model for sentiment analysis and our compute-intensive ResNet50 model to run on the same GPU.

    models/bert_large/config.pbtxt

    pbtxt
    name: "bert_large"
    platform: "onnxruntime"
    max_batch_size: 32
    # ... input/output ...
    
    instance_group [
      {
        count: 2
        kind: KIND_GPU
        gpus: [0]
      }
    ]
    
    dynamic_batching {
      max_queue_delay_microseconds: 10000
    }

    models/resnet50_fp16/config.pbtxt

    pbtxt
    name: "resnet50_fp16"
    # ... other settings ...
    
    instance_group [
      {
        count: 3
        kind: KIND_GPU
        gpus: [0]
      }
    ]
    
    dynamic_batching {
      max_queue_delay_microseconds: 5000
    }

    Here's what this configuration accomplishes:

    * Triton will load 2 instances of bert_large and 3 instances of resnet50_fp16 into the memory of GPU 0.

    * When requests arrive for both models, Triton's scheduler can execute them in parallel using different CUDA streams. If a batch for ResNet50 is running, an incoming request for BERT doesn't have to wait; it can be scheduled on one of the idle BERT instances.

    * This dramatically improves GPU utilization. While one model instance might be waiting for data transfer (memory-bound), another can be executing a kernel (compute-bound).

    Production Pattern: The number of instances (count) is a function of your GPU's VRAM and the model's memory footprint. You can find the memory usage in Triton's logs on startup. A common strategy is to load instances until you reach ~90% VRAM utilization, leaving a buffer for execution overhead.

    Advanced Pattern: CPU/GPU Affinity

    Some models, particularly those for pre/post-processing, are CPU-bound. Running them on the GPU is wasteful. The instance_group setting can pin these to the CPU.

    Consider a text pre-processing model that tokenizes input for BERT.

    models/bert_tokenizer/config.pbtxt

    pbtxt
    name: "bert_tokenizer"
    backend: "python"
    max_batch_size: 128
    # ... input/output ...
    
    instance_group [
      {
        count: 8
        kind: KIND_CPU
      }
    ]

    * kind: KIND_CPU: This instructs Triton to run these instances on the CPU, not touching the GPU.

    * count: 8: You can scale the number of CPU instances to match the number of available vCPUs on your Kubernetes node to maximize parallelism.

    This hybrid CPU/GPU pattern is essential for building efficient, multi-stage inference pipelines.

    Production Deployment on Kubernetes

    Now, let's tie this together in a production-ready Kubernetes Deployment.

    The Kubernetes Manifest

    yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: triton-inference-server
      labels:
        app: triton
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: triton
      template:
        metadata:
          labels:
            app: triton
        spec:
          containers:
            - name: triton
              image: nvcr.io/nvidia/tritonserver:23.10-py3
              args:
                - "tritonserver"
                - "--model-repository=/models"
                - "--model-control-mode=poll"
                - "--repository-poll-secs=30"
                - "--log-verbose=1"
              ports:
                - containerPort: 8000
                  name: http
                - containerPort: 8001
                  name: grpc
                - containerPort: 8002
                  name: metrics
              resources:
                limits:
                  nvidia.com/gpu: 1
                  memory: "16Gi"
                  cpu: "8"
                requests:
                  nvidia.com/gpu: 1
                  memory: "16Gi"
                  cpu: "8"
              volumeMounts:
                - name: models
                  mountPath: /models
              livenessProbe:
                httpGet:
                  path: /v2/health/live
                  port: http
                initialDelaySeconds: 30
                periodSeconds: 10
              readinessProbe:
                httpGet:
                  path: /v2/health/ready
                  port: http
                initialDelaySeconds: 60
                periodSeconds: 5
    
            - name: model-syncer
              image: k8s.gcr.io/git-sync/git-sync:v3.6.4
              args:
                - "--repo=https://your-git-repo/models.git"
                - "--branch=main"
                - "--dest=/git"
                - "--wait=30"
              volumeMounts:
                - name: models
                  mountPath: /git
    
          volumes:
            - name: models
              emptyDir: {}

    Key Production Features of this Manifest:

  • Resource Limits (nvidia.com/gpu: 1): This is non-negotiable. It ensures the Kubernetes scheduler places the pod on a GPU-enabled node and dedicates one full GPU to it.
  • Probes: livenessProbe and readinessProbe are critical. They allow Kubernetes to know if the Triton server is alive and ready to accept traffic. The initialDelaySeconds should be long enough to allow Triton to load all models from the repository.
  • Model Repository Management: We are using the Git-Sync Sidecar pattern. A git-sync container runs alongside Triton, periodically pulling the latest models from a Git repository into a shared emptyDir volume. Triton is configured with --model-control-mode=poll to automatically detect and load/unload model changes without a pod restart. This enables a robust GitOps workflow for MLOps.
  • * Alternative Pattern: For very large models that don't fit well in Git, replace the git-sync sidecar with an initContainer that downloads models from an S3 bucket or GCS on startup.

    Tying It All Together: A Complex Pipeline with Ensembling

    Let's architect a solution for a real-world problem: a sentiment analysis API that takes raw text, tokenizes it, and runs it through a BERT model.

    Doing this with two separate service calls (Client -> Tokenizer -> Client -> BERT) introduces network latency and complexity. Triton's Ensemble Scheduler can chain these models together on the server-side into a single, atomic operation.

    We need three model directories:

  • bert_tokenizer: The CPU-bound Python model.
  • bert_large: The GPU-bound ONNX model.
  • sentiment_pipeline: A special ensemble model that defines the execution graph.
  • models/sentiment_pipeline/config.pbtxt

    pbtxt
    name: "sentiment_pipeline"
    platform: "ensemble"
    max_batch_size: 128
    
    input [
      {
        name: "RAW_TEXT"
        data_type: TYPE_STRING
        dims: [ 1 ]
      }
    ]
    
    output [
      {
        name: "PREDICTIONS"
        data_type: TYPE_FP32
        dims: [ 2 ]
      }
    ]
    
    ensemble_scheduling {
      step: [
        {
          model_name: "bert_tokenizer"
          model_version: -1
          input_map {
            key: "TEXT_INPUT"
            value: "RAW_TEXT"
          }
          output_map [
            {
              key: "INPUT_IDS"
              value: "tokenized_ids"
            },
            {
              key: "ATTENTION_MASK"
              value: "tokenized_mask"
            }
          ]
        },
        {
          model_name: "bert_large"
          model_version: -1
          input_map [
            {
              key: "input_ids"
              value: "tokenized_ids"
            },
            {
              key: "attention_mask"
              value: "tokenized_mask"
            }
          ]
          output_map {
            key: "output"
            value: "PREDICTIONS"
          }
        }
      ]
    }

    This configuration defines a two-step Directed Acyclic Graph (DAG):

  • The ensemble's input RAW_TEXT is mapped to the TEXT_INPUT of the bert_tokenizer model.
  • The outputs of bert_tokenizer (INPUT_IDS and ATTENTION_MASK) are given temporary internal tensor names (tokenized_ids, tokenized_mask).
  • These internal tensors are then mapped to the inputs of the bert_large model.
  • Finally, the output of bert_large is mapped to the ensemble's final output, PREDICTIONS.
  • The client now makes a single call to the sentiment_pipeline endpoint, and Triton orchestrates the entire multi-model, CPU-to-GPU workflow internally, eliminating network overhead and simplifying the client-side logic.

    Advanced Monitoring with Prometheus

    Optimizing these systems is impossible without visibility. Triton exposes a rich set of metrics on its /metrics port (8002).

    First, configure Prometheus to scrape this endpoint using a ServiceMonitor CRD.

    yaml
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: triton-metrics
      labels:
        release: prometheus
    spec:
      selector:
        matchLabels:
          app: triton
      endpoints:
      - port: metrics
        interval: 15s

    Key Metrics and PromQL Queries:

    * GPU Utilization: The most basic health check.

    avg by (gpu_uuid) (nv_gpu_utilization{namespace="default", pod="triton-inference-server-xxxx"})

    * p99 Queue Latency: This is your primary metric for tuning max_queue_delay_microseconds. If this value exceeds your SLO, your delay is too high or you lack sufficient model instances.

    histogram_quantile(0.99, sum(rate(nv_inference_queue_duration_us_bucket{model="resnet50_fp16"}[5m])) by (le)) / 1000 (in ms)

    * Request Throughput vs. Compute Latency: This helps you understand if you're bottlenecked by request queuing or by the model itself.

    * Throughput: sum(rate(nv_inference_request_success{model="resnet50_fp16"}[1m]))

    * p95 Compute Latency: histogram_quantile(0.95, sum(rate(nv_inference_compute_infer_duration_us_bucket{model="resnet50_fp16"}[5m])) by (le))

    By building Grafana dashboards with these queries, you can create a feedback loop: deploy a config.pbtxt change, observe the impact on latency and throughput under load, and iterate. This data-driven approach is the hallmark of a mature MLOps practice.

    Conclusion

    Effectively serving ML models in production on Kubernetes is a systems design problem, not just a machine learning problem. By moving beyond naive deployment patterns and mastering Triton's advanced features, you can transform your ML inference platform from a costly, underperforming liability into an efficient, scalable, and robust system. The key takeaways are:

  • Always use Dynamic Batching: It's the single most effective tool for increasing GPU throughput. Tune max_queue_delay_microseconds methodically based on your latency SLOs and benchmark data.
  • Co-locate Models with Concurrent Instances: Maximize GPU utilization by running multiple models or multiple instances of the same model on a single GPU. Use a mix of KIND_GPU and KIND_CPU instances to match your workload.
  • Automate Model Deployment: Use a GitOps pattern with a git-sync sidecar or an object-store-based initContainer to manage your model repository without manual intervention.
  • Chain Operations with Ensembling: For multi-stage pipelines, use ensembles to eliminate network latency and simplify your architecture.
  • Monitor Everything: You cannot optimize what you cannot measure. Leverage Triton's Prometheus metrics to gain deep insight into queue times, execution latency, and GPU health.
  • By implementing these production-grade patterns, you can build an inference serving layer that is not only performant but also cost-effective, allowing you to scale your AI/ML services without scaling your cloud bill proportionally.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles