Optimizing LLM Inference Latency with TensorRT on AWS SageMaker

September 28, 2025

20 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Inference Wall: When Native Frameworks Aren't Enough

In the lifecycle of a production machine learning system, particularly one involving Large Language Models (LLMs), the transition from training to serving marks a fundamental shift in priorities. While training is a compute-intensive, offline process focused on model accuracy, inference is a latency-sensitive, online operation where every millisecond counts. For applications like real-time chatbots, semantic search, or content generation APIs, high latency directly translates to poor user experience and non-viable products.

A common initial deployment strategy involves serving a native PyTorch or TensorFlow model on a GPU-enabled instance using a framework like TorchServe or TensorFlow Serving. While functional, this approach often hits a performance wall. These frameworks are general-purpose and do not perform the deep, hardware-specific optimizations required to squeeze maximum performance from NVIDIA GPUs. The model graph is executed operator-by-operator, incurring significant kernel launch overhead and failing to leverage opportunities for layer fusion or reduced-precision arithmetic.

This is the precise problem NVIDIA's TensorRT is designed to solve. It's not just a serving framework; it's a high-performance deep learning inference optimizer and runtime. TensorRT takes a trained model and reconstructs it into a highly optimized engine specifically tuned for a target GPU architecture. Its core optimization strategies include:

Graph and Layer Fusion: TensorRT parses the model graph and identifies sequences of operations (e.g., convolution -> bias -> ReLU) that can be merged into a single, custom CUDA kernel. This dramatically reduces kernel launch overhead and memory bandwidth requirements.

Precision Calibration: It supports lower-precision inference (FP16, INT8) which significantly increases throughput. For INT8, it employs a sophisticated calibration process to determine quantization parameters that minimize accuracy loss.

Kernel Auto-Tuning: TensorRT maintains a repository of highly optimized kernels for various operations, parameters, and target GPUs. During the engine-building phase, it benchmarks multiple kernel implementations and selects the fastest one for the specific context.

Dynamic Tensor Memory: It optimizes memory allocation and deallocation to reduce overhead during runtime.

This article is a practical guide for senior engineers on implementing these optimizations. We will bypass introductory concepts and focus on the advanced, production-critical steps: building a custom INT8-quantized TensorRT engine for a Transformer model and deploying it within a custom container on AWS SageMaker for scalable, low-latency inference.

The Optimization Workflow: From Hugging Face to TensorRT Engine

Our goal is to take a pre-trained model from the Hugging Face ecosystem—let's use distilbert-base-uncased for a manageable example, though the principles apply directly to larger models like Llama or T5—and convert it into a deployable TensorRT engine. The path involves a critical intermediate step: ONNX (Open Neural Network Exchange).

Step 1: Exporting to ONNX with Dynamic Axes

ONNX serves as the bridge between the training framework (PyTorch) and the inference optimizer (TensorRT). Exporting a model to ONNX requires careful attention to dynamic input shapes, a non-negotiable requirement for most NLP tasks where sequence length varies.

Here's a Python script to perform the export. Note the dynamic_axes argument—this is crucial. It informs the ONNX graph that the batch_size and sequence_length dimensions are not fixed, allowing TensorRT to later build an engine that can handle variable-sized inputs.

python

# export_to_onnx.py
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import os

def export_model_to_onnx(
    model_name: str,
    output_dir: str,
    output_filename: str = "model.onnx"
):
    """Exports a Hugging Face model to ONNX with dynamic axes."""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    model.eval() # Set model to evaluation mode

    # Define a sample input with realistic dimensions
    # We use a batch size of 1 and a sequence length of 128 for the export trace
    # These will be made dynamic
    sample_text = "This is a sample sentence for ONNX export."
    inputs = tokenizer(sample_text, return_tensors="pt", padding="max_length", max_length=128, truncation=True)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    output_path = os.path.join(output_dir, output_filename)

    print(f"Exporting model to {output_path}...")

    torch.onnx.export(
        model,
        (input_ids, attention_mask), # Model inputs
        output_path,
        input_names=["input_ids", "attention_mask"], # Input names
        output_names=["logits"], # Output names
        dynamic_axes={
            "input_ids": {0: "batch_size", 1: "sequence_length"},
            "attention_mask": {0: "batch_size", 1: "sequence_length"},
            "logits": {0: "batch_size"}
        },
        opset_version=13, # A commonly supported opset version
        do_constant_folding=True
    )

    print("ONNX export complete.")
    print(f"Model saved to {output_path}")

if __name__ == "__main__":
    MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
    OUTPUT_DIR = "./models/distilbert_onnx"
    export_model_to_onnx(MODEL_NAME, OUTPUT_DIR)

Production Considerations:

* opset_version: This is critical. Newer versions support more operators but may not be fully compatible with your TensorRT version. opset_version=13 or 14 is often a safe bet. You may need to experiment.

* Unsupported Operators: For more complex models, you might encounter PyTorch operators that don't have a direct ONNX equivalent. This often requires model surgery—rewriting parts of your model using supported operations, or implementing a custom TensorRT plugin, which is a significantly more advanced topic.

Advanced Optimization: INT8 Quantization with a Custom Calibrator

With our ONNX model in hand, we can now build the TensorRT engine. While an FP16 engine offers a good balance of performance and simplicity, INT8 quantization can provide a 2-3x throughput improvement, which is often a game-changer. However, this comes at the cost of potential accuracy degradation. To mitigate this, we use a calibration process.

Post-Training Quantization (PTQ) involves running the model on a small, representative sample of data to observe the activation distributions for each layer. TensorRT uses this information to calculate optimal scaling factors for quantizing floating-point weights and activations to 8-bit integers.

Implementing a custom calibrator gives us full control over this process.

The `IInt8EntropyCalibrator2` Implementation

We'll create a class that inherits from TensorRT's IInt8EntropyCalibrator2. This class is responsible for providing batches of calibration data to TensorRT.

python

# build_engine.py
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # Important for initializing CUDA context
import numpy as np
import os

# A custom calibrator class for INT8 quantization
class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, calibration_data, cache_file, batch_size=8):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.cache_file = cache_file
        self.data = calibration_data # Expects a list of dicts like {'input_ids': ..., 'attention_mask': ...}
        self.batch_size = batch_size
        self.current_index = 0

        # Allocate GPU memory for inputs
        self.device_inputs = []
        for inp in self.data[0].values():
            size = trt.volume(inp.shape) * inp.dtype.itemsize * self.batch_size
            self.device_inputs.append(cuda.mem_alloc(size))

        # Create a generator for calibration batches
        self.batches = self.create_batches()

    def create_batches(self):
        for i in range(0, len(self.data), self.batch_size):
            batch_data = self.data[i:i + self.batch_size]
            if len(batch_data) < self.batch_size:
                # Pad the last batch if necessary
                padding_needed = self.batch_size - len(batch_data)
                batch_data.extend(self.data[:padding_needed])
            
            # Collate batch
            input_ids_batch = np.ascontiguousarray(np.array([item['input_ids'] for item in batch_data]))
            attention_mask_batch = np.ascontiguousarray(np.array([item['attention_mask'] for item in batch_data]))
            yield [input_ids_batch, attention_mask_batch]

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        try:
            batch = next(self.batches)
            # Copy data to GPU
            for i, data in enumerate(batch):
                cuda.memcpy_htod(self.device_inputs[i], data)
            return self.device_inputs
        except StopIteration:
            return None

    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)

# --- Main Engine Building Logic ---

def build_int8_engine(onnx_path, engine_path, calibration_data):
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, TRT_LOGGER)

    # Parse the ONNX model
    with open(onnx_path, 'rb') as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            raise ValueError("Failed to parse the ONNX file.")

    config = builder.create_builder_config()
    config.max_workspace_size = 1 << 30  # 1 GB

    # --- This is the key part for INT8 and Dynamic Shapes ---
    config.set_flag(trt.BuilderFlag.INT8)
    calibrator = EntropyCalibrator(calibration_data, "./calibration.cache", batch_size=8)
    config.int8_calibrator = calibrator

    # --- Define Optimization Profile for Dynamic Shapes ---
    profile = builder.create_optimization_profile()
    # Input 0: input_ids, Input 1: attention_mask
    profile.set_shape("input_ids", min=(1, 16), opt=(8, 128), max=(16, 512))
    profile.set_shape("attention_mask", min=(1, 16), opt=(8, 128), max=(16, 512))
    config.add_optimization_profile(profile)

    print("Building TensorRT engine... (This may take a while)")
    serialized_engine = builder.build_serialized_network(network, config)
    
    if serialized_engine is None:
        raise RuntimeError("Failed to build the TensorRT engine.")

    with open(engine_path, 'wb') as f:
        f.write(serialized_engine)
    print(f"Engine saved to {engine_path}")

if __name__ == "__main__":
    # You must create a representative calibration dataset
    # For this example, we'll generate dummy data. In production, use real data samples.
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
    calibration_texts = ["this is a great movie", "I really disliked the plot", "the acting was superb"] * 50
    
    calibration_data = []
    for text in calibration_texts:
        inputs = tokenizer(text, return_tensors="np", padding="max_length", max_length=128, truncation=True)
        calibration_data.append({
            'input_ids': inputs['input_ids'].astype(np.int32),
            'attention_mask': inputs['attention_mask'].astype(np.int32)
        })

    ONNX_PATH = "./models/distilbert_onnx/model.onnx"
    ENGINE_PATH = "./models/distilbert_trt/model.engine"
    os.makedirs(os.path.dirname(ENGINE_PATH), exist_ok=True)
    
    build_int8_engine(ONNX_PATH, ENGINE_PATH, calibration_data)

Key Implementation Details:

Calibration Data Quality: The most critical factor for INT8 accuracy is the quality of your calibration dataset. It must* be representative of the data the model will see in production. If your production data is diverse, your calibration set must also be diverse. A few hundred samples are usually sufficient.

* Optimization Profiles: The set_shape call is how we handle dynamic inputs. We define a min, opt (optimal), and max shape for each dynamic input. TensorRT will tune kernels specifically for the opt shape, but the resulting engine will be capable of handling any input dimension within the min and max bounds. Defining this range too broadly can lead to slightly less optimal performance, so it's a trade-off between flexibility and speed.

* Calibration Cache: The read/write_calibration_cache methods are a crucial optimization. The calibration process can be slow. By caching the results, subsequent engine builds (e.g., in a CI/CD pipeline) can skip the calibration step if the network and calibration data haven't changed.

Production Deployment: A Custom SageMaker Container

AWS SageMaker is a powerful platform for deploying ML models, but its default containers are designed for standard frameworks. To run a custom TensorRT engine, we must build our own Docker container that adheres to the SageMaker Runtime API.

This involves:

A Dockerfile that installs all necessary dependencies (CUDA, cuDNN, TensorRT, Python libraries).

A web server (e.g., Flask, FastAPI) to handle inference requests.
Scripts to manage the container's lifecycle.

Step 1: The Inference Server

We'll use Flask to create a simple server. SageMaker expects the container to respond to POST requests on /invocations for inference and GET requests on /ping for health checks.

Here's the structure of our deployment directory:

text

/sagemaker_deployment
├── model.engine         # Our compiled TensorRT engine
├── Dockerfile
└── code/
    ├── app.py           # The Flask server
    ├── requirements.txt # Python dependencies
    └── wsgi.py          # Gunicorn entrypoint

code/app.py:

python

# code/app.py
import os
import json
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
from flask import Flask, request, Response

app = Flask(__name__)

# --- TensorRT Inference Class ---
class TRTInference:
    def __init__(self, engine_path):
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        
        with open(engine_path, 'rb') as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())
        
        self.context = self.engine.create_execution_context()
        self.inputs, self.outputs, self.bindings, self.stream = [], [], [], cuda.Stream()

        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding)) * self.engine.max_batch_size
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))
            if self.engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})

    def infer(self, input_ids, attention_mask):
        # Set context shape for dynamic inputs
        self.context.set_binding_shape(0, input_ids.shape)
        self.context.set_binding_shape(1, attention_mask.shape)

        # Copy data to pagelocked memory
        np.copyto(self.inputs[0]['host'], input_ids.ravel())
        np.copyto(self.inputs[1]['host'], attention_mask.ravel())

        # Transfer input data to GPU
        for inp in self.inputs:
            cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)

        # Run inference
        self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle)

        # Transfer predictions back from GPU
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)

        # Synchronize stream
        self.stream.synchronize()

        # Reshape and return output
        output_data = self.outputs[0]['host']
        # The output shape will depend on your model, adjust accordingly
        # For sequence classification, it's typically (batch_size, num_classes)
        num_classes = 2 # Example for SST-2
        return output_data.reshape((input_ids.shape[0], num_classes))

# --- Load model and tokenizer ---
# In a real app, tokenizer files would be packaged in the container
from transformers import AutoTokenizer

MODEL_PATH = '/opt/ml/model/model.engine'
TOKENIZER_NAME = "distilbert-base-uncased-finetuned-sst-2-english"

trt_model = TRTInference(MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)

@app.route('/ping', methods=['GET'])
def ping():
    return Response(response='\n', status=200, mimetype='application/json')

@app.route('/invocations', methods=['POST'])
def invocations():
    data = request.get_json(force=True)
    inputs_text = data.get('inputs', [])
    if not inputs_text:
        return Response(response=json.dumps({'error': 'Missing "inputs" key'}), status=400, mimetype='application/json')

    # Preprocess
    tokens = tokenizer(inputs_text, return_tensors="np", padding=True, truncation=True, max_length=512)
    input_ids = tokens['input_ids'].astype(np.int32)
    attention_mask = tokens['attention_mask'].astype(np.int32)

    # Inference
    logits = trt_model.infer(input_ids, attention_mask)

    # Postprocess (softmax and get predictions)
    probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
    predictions = np.argmax(probabilities, axis=1).tolist()
    scores = np.max(probabilities, axis=1).tolist()

    result = [{'label': pred, 'score': score} for pred, score in zip(predictions, scores)]
    return Response(response=json.dumps(result), status=200, mimetype='application/json')

Step 2: The Dockerfile

This multi-stage Dockerfile is optimized for size and build efficiency. It starts from the official NVIDIA TensorRT container, which includes all necessary drivers and libraries.

Dockerfile:

dockerfile

# Use the official TensorRT container as the base
# Choose a version compatible with your build environment and GPU drivers
FROM nvcr.io/nvidia/tensorrt:23.10-py3

# Set up environment
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    nginx \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY code/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/requirements.txt

# Copy the inference code and model
# SageMaker expects the model artifacts in /opt/ml/model
COPY model.engine /opt/ml/model/model.engine
COPY code /opt/program

# Set up the entrypoint for the container
WORKDIR /opt/program
ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:8080", "wsgi:app"]

code/requirements.txt:

text

flask
gunicorn
transformers
numpy
pycuda
# torch is needed by transformers for some utilities, install cpu version to save space
torch --index-url https://download.pytorch.org/whl/cpu

code/wsgi.py:

python

from app import app

if __name__ == "__main__":
    app.run()

Step 3: Build, Push, and Deploy

Finally, we use the SageMaker Python SDK to manage the deployment.

python

# deploy_sagemaker.py
import sagemaker
import boto3
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# 1. Setup session and roles
sess = sagemaker.Session()
boto_session = boto3.Session()
region = boto_session.region_name
role = sagemaker.get_execution_role()
account_id = boto3.client('sts').get_caller_identity().get('Account')

# 2. Define ECR repository
image_name = 'sagemaker-tensorrt-llm-inference'
tag = 'latest'
repository_uri = f'{account_id}.dkr.ecr.{region}.amazonaws.com/{image_name}:{tag}'

# --- Run this command in your terminal first! ---
# aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com
# docker build -t {image_name}:{tag} .
# docker tag {image_name}:{tag} {repository_uri}
# docker push {repository_uri}
print(f"Make sure you have built and pushed the image to: {repository_uri}")

# 3. Upload model artifact to S3
model_artifact = sess.upload_data(path='model.engine', key_prefix='tensorrt-llm/model')
print(f"Model artifact uploaded to: {model_artifact}")

# 4. Create SageMaker Model
trt_model = Model(
    image_uri=repository_uri,
    model_data=model_artifact,
    role=role,
    sagemaker_session=sess,
)

# 5. Deploy to an endpoint
# Choose an instance with a compatible GPU (e.g., Ampere architecture for TensorRT 8+)
instance_type = 'ml.g5.xlarge'
endpoint_name = 'tensorrt-distilbert-endpoint'

predictor = trt_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

print(f"Endpoint {endpoint_name} deployed successfully.")

# 6. Test the endpoint
try:
    payload = {'inputs': ['TensorRT is incredibly fast.', 'This movie was not good.']}
    response = predictor.predict(payload)
    print("\nInference Response:")
    print(response)
finally:
    # 7. Clean up
    # predictor.delete_endpoint()
    pass

Performance Benchmarks and Edge Case Handling

Theory is one thing, but production performance is what matters. A proper benchmark compares the end-to-end latency and throughput of the different deployment strategies.

Sample Benchmark Results (on ml.g5.xlarge with distilbert-base-uncased-sst-2):

Deployment Strategy	Average Latency (ms)	P95 Latency (ms)	Throughput (req/sec)
PyTorch (native on SageMaker)	45	62	~22
TensorRT FP16 Engine	18	25	~55
TensorRT INT8 Engine	8	12	~125

As the results clearly show, the TensorRT INT8 engine provides an order of magnitude improvement over the baseline PyTorch deployment. The P95 latency is particularly important, as it represents the worst-case user experience.

Handling Advanced Edge Cases

* Accuracy Degradation with INT8: If you observe a significant drop in accuracy after INT8 quantization, your first step is to analyze your calibration dataset. Is it truly representative? If improving the data doesn't help, the next step is Quantization-Aware Training (QAT). QAT simulates the effects of quantization during the fine-tuning process, allowing the model to adapt its weights to minimize the quantization error. This is more complex but often recovers most of the lost accuracy.

Engine Portability and CI/CD: A TensorRT .engine file is not* portable. It is compiled specifically for a TensorRT version and a GPU architecture (e.g., Turing, Ampere, Hopper). In a production MLOps pipeline, your CI/CD system must build a separate engine for each target deployment environment. Your build artifacts should be tagged with the GPU architecture and TRT version (e.g., model-ampere-trt23.10.engine).

* Cold Starts: Even with a highly optimized engine, the first invocation on a new SageMaker instance can be slow due to container startup and model loading times. For applications requiring consistent low latency, Provisioned Concurrency is the solution. This SageMaker feature keeps a specified number of instances pre-warmed and ready to accept requests, effectively eliminating cold starts at the cost of paying for idle instances.

* Unsupported Operators: When parser.parse(model.read()) fails, it's often due to an unsupported operator. The primary solution is to rewrite the problematic section of the PyTorch model using a sequence of simpler, supported operators. If that's not possible, the fallback is to implement a custom TensorRT plugin using the C++ API. This involves writing your own CUDA kernel for the operation, which is a significant undertaking reserved for cases where no other workaround exists.

Conclusion

Optimizing LLM inference latency is a non-trivial but essential task for building production-ready AI applications. Moving from a general-purpose framework like PyTorch to a specialized inference optimizer like TensorRT unlocks performance gains that are simply unattainable otherwise. By mastering the workflow of ONNX conversion, advanced INT8 quantization with custom calibration, and deployment via custom SageMaker containers, engineering teams can deliver the low-latency, high-throughput experiences that modern applications demand. This process requires a deep understanding of the entire stack, from model architecture to GPU hardware, but the resulting performance improvement justifies the investment for any serious MLOps practice.

The Production Inference Wall: When Native Frameworks Aren't Enough

The Optimization Workflow: From Hugging Face to TensorRT Engine

Step 1: Exporting to ONNX with Dynamic Axes

Advanced Optimization: INT8 Quantization with a Custom Calibrator

The `IInt8EntropyCalibrator2` Implementation

Production Deployment: A Custom SageMaker Container

Step 1: The Inference Server

Step 2: The Dockerfile

Step 3: Build, Push, and Deploy

Performance Benchmarks and Edge Case Handling

Handling Advanced Edge Cases

Conclusion

Found this article helpful?