Optimizing LLM Inference Latency with TensorRT on AWS SageMaker
The Production Inference Wall: When Native Frameworks Aren't Enough
In the lifecycle of a production machine learning system, particularly one involving Large Language Models (LLMs), the transition from training to serving marks a fundamental shift in priorities. While training is a compute-intensive, offline process focused on model accuracy, inference is a latency-sensitive, online operation where every millisecond counts. For applications like real-time chatbots, semantic search, or content generation APIs, high latency directly translates to poor user experience and non-viable products.
A common initial deployment strategy involves serving a native PyTorch or TensorFlow model on a GPU-enabled instance using a framework like TorchServe or TensorFlow Serving. While functional, this approach often hits a performance wall. These frameworks are general-purpose and do not perform the deep, hardware-specific optimizations required to squeeze maximum performance from NVIDIA GPUs. The model graph is executed operator-by-operator, incurring significant kernel launch overhead and failing to leverage opportunities for layer fusion or reduced-precision arithmetic.
This is the precise problem NVIDIA's TensorRT is designed to solve. It's not just a serving framework; it's a high-performance deep learning inference optimizer and runtime. TensorRT takes a trained model and reconstructs it into a highly optimized engine specifically tuned for a target GPU architecture. Its core optimization strategies include:
This article is a practical guide for senior engineers on implementing these optimizations. We will bypass introductory concepts and focus on the advanced, production-critical steps: building a custom INT8-quantized TensorRT engine for a Transformer model and deploying it within a custom container on AWS SageMaker for scalable, low-latency inference.
The Optimization Workflow: From Hugging Face to TensorRT Engine
Our goal is to take a pre-trained model from the Hugging Face ecosystem—let's use distilbert-base-uncased
for a manageable example, though the principles apply directly to larger models like Llama or T5—and convert it into a deployable TensorRT engine. The path involves a critical intermediate step: ONNX (Open Neural Network Exchange).
Step 1: Exporting to ONNX with Dynamic Axes
ONNX serves as the bridge between the training framework (PyTorch) and the inference optimizer (TensorRT). Exporting a model to ONNX requires careful attention to dynamic input shapes, a non-negotiable requirement for most NLP tasks where sequence length varies.
Here's a Python script to perform the export. Note the dynamic_axes
argument—this is crucial. It informs the ONNX graph that the batch_size
and sequence_length
dimensions are not fixed, allowing TensorRT to later build an engine that can handle variable-sized inputs.
# export_to_onnx.py
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import os
def export_model_to_onnx(
model_name: str,
output_dir: str,
output_filename: str = "model.onnx"
):
"""Exports a Hugging Face model to ONNX with dynamic axes."""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval() # Set model to evaluation mode
# Define a sample input with realistic dimensions
# We use a batch size of 1 and a sequence length of 128 for the export trace
# These will be made dynamic
sample_text = "This is a sample sentence for ONNX export."
inputs = tokenizer(sample_text, return_tensors="pt", padding="max_length", max_length=128, truncation=True)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
output_path = os.path.join(output_dir, output_filename)
print(f"Exporting model to {output_path}...")
torch.onnx.export(
model,
(input_ids, attention_mask), # Model inputs
output_path,
input_names=["input_ids", "attention_mask"], # Input names
output_names=["logits"], # Output names
dynamic_axes={
"input_ids": {0: "batch_size", 1: "sequence_length"},
"attention_mask": {0: "batch_size", 1: "sequence_length"},
"logits": {0: "batch_size"}
},
opset_version=13, # A commonly supported opset version
do_constant_folding=True
)
print("ONNX export complete.")
print(f"Model saved to {output_path}")
if __name__ == "__main__":
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
OUTPUT_DIR = "./models/distilbert_onnx"
export_model_to_onnx(MODEL_NAME, OUTPUT_DIR)
Production Considerations:
* opset_version
: This is critical. Newer versions support more operators but may not be fully compatible with your TensorRT version. opset_version=13
or 14
is often a safe bet. You may need to experiment.
* Unsupported Operators: For more complex models, you might encounter PyTorch operators that don't have a direct ONNX equivalent. This often requires model surgery—rewriting parts of your model using supported operations, or implementing a custom TensorRT plugin, which is a significantly more advanced topic.
Advanced Optimization: INT8 Quantization with a Custom Calibrator
With our ONNX model in hand, we can now build the TensorRT engine. While an FP16 engine offers a good balance of performance and simplicity, INT8 quantization can provide a 2-3x throughput improvement, which is often a game-changer. However, this comes at the cost of potential accuracy degradation. To mitigate this, we use a calibration process.
Post-Training Quantization (PTQ) involves running the model on a small, representative sample of data to observe the activation distributions for each layer. TensorRT uses this information to calculate optimal scaling factors for quantizing floating-point weights and activations to 8-bit integers.
Implementing a custom calibrator gives us full control over this process.
The `IInt8EntropyCalibrator2` Implementation
We'll create a class that inherits from TensorRT's IInt8EntropyCalibrator2
. This class is responsible for providing batches of calibration data to TensorRT.
# build_engine.py
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # Important for initializing CUDA context
import numpy as np
import os
# A custom calibrator class for INT8 quantization
class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, calibration_data, cache_file, batch_size=8):
trt.IInt8EntropyCalibrator2.__init__(self)
self.cache_file = cache_file
self.data = calibration_data # Expects a list of dicts like {'input_ids': ..., 'attention_mask': ...}
self.batch_size = batch_size
self.current_index = 0
# Allocate GPU memory for inputs
self.device_inputs = []
for inp in self.data[0].values():
size = trt.volume(inp.shape) * inp.dtype.itemsize * self.batch_size
self.device_inputs.append(cuda.mem_alloc(size))
# Create a generator for calibration batches
self.batches = self.create_batches()
def create_batches(self):
for i in range(0, len(self.data), self.batch_size):
batch_data = self.data[i:i + self.batch_size]
if len(batch_data) < self.batch_size:
# Pad the last batch if necessary
padding_needed = self.batch_size - len(batch_data)
batch_data.extend(self.data[:padding_needed])
# Collate batch
input_ids_batch = np.ascontiguousarray(np.array([item['input_ids'] for item in batch_data]))
attention_mask_batch = np.ascontiguousarray(np.array([item['attention_mask'] for item in batch_data]))
yield [input_ids_batch, attention_mask_batch]
def get_batch_size(self):
return self.batch_size
def get_batch(self, names):
try:
batch = next(self.batches)
# Copy data to GPU
for i, data in enumerate(batch):
cuda.memcpy_htod(self.device_inputs[i], data)
return self.device_inputs
except StopIteration:
return None
def read_calibration_cache(self):
if os.path.exists(self.cache_file):
with open(self.cache_file, "rb") as f:
return f.read()
def write_calibration_cache(self, cache):
with open(self.cache_file, "wb") as f:
f.write(cache)
# --- Main Engine Building Logic ---
def build_int8_engine(onnx_path, engine_path, calibration_data):
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, TRT_LOGGER)
# Parse the ONNX model
with open(onnx_path, 'rb') as model:
if not parser.parse(model.read()):
for error in range(parser.num_errors):
print(parser.get_error(error))
raise ValueError("Failed to parse the ONNX file.")
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30 # 1 GB
# --- This is the key part for INT8 and Dynamic Shapes ---
config.set_flag(trt.BuilderFlag.INT8)
calibrator = EntropyCalibrator(calibration_data, "./calibration.cache", batch_size=8)
config.int8_calibrator = calibrator
# --- Define Optimization Profile for Dynamic Shapes ---
profile = builder.create_optimization_profile()
# Input 0: input_ids, Input 1: attention_mask
profile.set_shape("input_ids", min=(1, 16), opt=(8, 128), max=(16, 512))
profile.set_shape("attention_mask", min=(1, 16), opt=(8, 128), max=(16, 512))
config.add_optimization_profile(profile)
print("Building TensorRT engine... (This may take a while)")
serialized_engine = builder.build_serialized_network(network, config)
if serialized_engine is None:
raise RuntimeError("Failed to build the TensorRT engine.")
with open(engine_path, 'wb') as f:
f.write(serialized_engine)
print(f"Engine saved to {engine_path}")
if __name__ == "__main__":
# You must create a representative calibration dataset
# For this example, we'll generate dummy data. In production, use real data samples.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
calibration_texts = ["this is a great movie", "I really disliked the plot", "the acting was superb"] * 50
calibration_data = []
for text in calibration_texts:
inputs = tokenizer(text, return_tensors="np", padding="max_length", max_length=128, truncation=True)
calibration_data.append({
'input_ids': inputs['input_ids'].astype(np.int32),
'attention_mask': inputs['attention_mask'].astype(np.int32)
})
ONNX_PATH = "./models/distilbert_onnx/model.onnx"
ENGINE_PATH = "./models/distilbert_trt/model.engine"
os.makedirs(os.path.dirname(ENGINE_PATH), exist_ok=True)
build_int8_engine(ONNX_PATH, ENGINE_PATH, calibration_data)
Key Implementation Details:
Calibration Data Quality: The most critical factor for INT8 accuracy is the quality of your calibration dataset. It must* be representative of the data the model will see in production. If your production data is diverse, your calibration set must also be diverse. A few hundred samples are usually sufficient.
* Optimization Profiles: The set_shape
call is how we handle dynamic inputs. We define a min
, opt
(optimal), and max
shape for each dynamic input. TensorRT will tune kernels specifically for the opt
shape, but the resulting engine will be capable of handling any input dimension within the min
and max
bounds. Defining this range too broadly can lead to slightly less optimal performance, so it's a trade-off between flexibility and speed.
* Calibration Cache: The read/write_calibration_cache
methods are a crucial optimization. The calibration process can be slow. By caching the results, subsequent engine builds (e.g., in a CI/CD pipeline) can skip the calibration step if the network and calibration data haven't changed.
Production Deployment: A Custom SageMaker Container
AWS SageMaker is a powerful platform for deploying ML models, but its default containers are designed for standard frameworks. To run a custom TensorRT engine, we must build our own Docker container that adheres to the SageMaker Runtime API.
This involves:
Dockerfile
that installs all necessary dependencies (CUDA, cuDNN, TensorRT, Python libraries).- A web server (e.g., Flask, FastAPI) to handle inference requests.
- Scripts to manage the container's lifecycle.
Step 1: The Inference Server
We'll use Flask to create a simple server. SageMaker expects the container to respond to POST requests on /invocations
for inference and GET requests on /ping
for health checks.
Here's the structure of our deployment directory:
/sagemaker_deployment
├── model.engine # Our compiled TensorRT engine
├── Dockerfile
└── code/
├── app.py # The Flask server
├── requirements.txt # Python dependencies
└── wsgi.py # Gunicorn entrypoint
code/app.py
:
# code/app.py
import os
import json
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
from flask import Flask, request, Response
app = Flask(__name__)
# --- TensorRT Inference Class ---
class TRTInference:
def __init__(self, engine_path):
self.logger = trt.Logger(trt.Logger.WARNING)
self.runtime = trt.Runtime(self.logger)
with open(engine_path, 'rb') as f:
self.engine = self.runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
self.inputs, self.outputs, self.bindings, self.stream = [], [], [], cuda.Stream()
for binding in self.engine:
size = trt.volume(self.engine.get_binding_shape(binding)) * self.engine.max_batch_size
dtype = trt.nptype(self.engine.get_binding_dtype(binding))
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
self.bindings.append(int(device_mem))
if self.engine.binding_is_input(binding):
self.inputs.append({'host': host_mem, 'device': device_mem})
else:
self.outputs.append({'host': host_mem, 'device': device_mem})
def infer(self, input_ids, attention_mask):
# Set context shape for dynamic inputs
self.context.set_binding_shape(0, input_ids.shape)
self.context.set_binding_shape(1, attention_mask.shape)
# Copy data to pagelocked memory
np.copyto(self.inputs[0]['host'], input_ids.ravel())
np.copyto(self.inputs[1]['host'], attention_mask.ravel())
# Transfer input data to GPU
for inp in self.inputs:
cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)
# Run inference
self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle)
# Transfer predictions back from GPU
for out in self.outputs:
cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)
# Synchronize stream
self.stream.synchronize()
# Reshape and return output
output_data = self.outputs[0]['host']
# The output shape will depend on your model, adjust accordingly
# For sequence classification, it's typically (batch_size, num_classes)
num_classes = 2 # Example for SST-2
return output_data.reshape((input_ids.shape[0], num_classes))
# --- Load model and tokenizer ---
# In a real app, tokenizer files would be packaged in the container
from transformers import AutoTokenizer
MODEL_PATH = '/opt/ml/model/model.engine'
TOKENIZER_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
trt_model = TRTInference(MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
@app.route('/ping', methods=['GET'])
def ping():
return Response(response='\n', status=200, mimetype='application/json')
@app.route('/invocations', methods=['POST'])
def invocations():
data = request.get_json(force=True)
inputs_text = data.get('inputs', [])
if not inputs_text:
return Response(response=json.dumps({'error': 'Missing "inputs" key'}), status=400, mimetype='application/json')
# Preprocess
tokens = tokenizer(inputs_text, return_tensors="np", padding=True, truncation=True, max_length=512)
input_ids = tokens['input_ids'].astype(np.int32)
attention_mask = tokens['attention_mask'].astype(np.int32)
# Inference
logits = trt_model.infer(input_ids, attention_mask)
# Postprocess (softmax and get predictions)
probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
predictions = np.argmax(probabilities, axis=1).tolist()
scores = np.max(probabilities, axis=1).tolist()
result = [{'label': pred, 'score': score} for pred, score in zip(predictions, scores)]
return Response(response=json.dumps(result), status=200, mimetype='application/json')
Step 2: The Dockerfile
This multi-stage Dockerfile
is optimized for size and build efficiency. It starts from the official NVIDIA TensorRT container, which includes all necessary drivers and libraries.
Dockerfile
:
# Use the official TensorRT container as the base
# Choose a version compatible with your build environment and GPU drivers
FROM nvcr.io/nvidia/tensorrt:23.10-py3
# Set up environment
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY code/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/requirements.txt
# Copy the inference code and model
# SageMaker expects the model artifacts in /opt/ml/model
COPY model.engine /opt/ml/model/model.engine
COPY code /opt/program
# Set up the entrypoint for the container
WORKDIR /opt/program
ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:8080", "wsgi:app"]
code/requirements.txt
:
flask
gunicorn
transformers
numpy
pycuda
# torch is needed by transformers for some utilities, install cpu version to save space
torch --index-url https://download.pytorch.org/whl/cpu
code/wsgi.py
:
from app import app
if __name__ == "__main__":
app.run()
Step 3: Build, Push, and Deploy
Finally, we use the SageMaker Python SDK to manage the deployment.
# deploy_sagemaker.py
import sagemaker
import boto3
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
# 1. Setup session and roles
sess = sagemaker.Session()
boto_session = boto3.Session()
region = boto_session.region_name
role = sagemaker.get_execution_role()
account_id = boto3.client('sts').get_caller_identity().get('Account')
# 2. Define ECR repository
image_name = 'sagemaker-tensorrt-llm-inference'
tag = 'latest'
repository_uri = f'{account_id}.dkr.ecr.{region}.amazonaws.com/{image_name}:{tag}'
# --- Run this command in your terminal first! ---
# aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com
# docker build -t {image_name}:{tag} .
# docker tag {image_name}:{tag} {repository_uri}
# docker push {repository_uri}
print(f"Make sure you have built and pushed the image to: {repository_uri}")
# 3. Upload model artifact to S3
model_artifact = sess.upload_data(path='model.engine', key_prefix='tensorrt-llm/model')
print(f"Model artifact uploaded to: {model_artifact}")
# 4. Create SageMaker Model
trt_model = Model(
image_uri=repository_uri,
model_data=model_artifact,
role=role,
sagemaker_session=sess,
)
# 5. Deploy to an endpoint
# Choose an instance with a compatible GPU (e.g., Ampere architecture for TensorRT 8+)
instance_type = 'ml.g5.xlarge'
endpoint_name = 'tensorrt-distilbert-endpoint'
predictor = trt_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name,
serializer=JSONSerializer(),
deserializer=JSONDeserializer(),
)
print(f"Endpoint {endpoint_name} deployed successfully.")
# 6. Test the endpoint
try:
payload = {'inputs': ['TensorRT is incredibly fast.', 'This movie was not good.']}
response = predictor.predict(payload)
print("\nInference Response:")
print(response)
finally:
# 7. Clean up
# predictor.delete_endpoint()
pass
Performance Benchmarks and Edge Case Handling
Theory is one thing, but production performance is what matters. A proper benchmark compares the end-to-end latency and throughput of the different deployment strategies.
Sample Benchmark Results (on ml.g5.xlarge
with distilbert-base-uncased-sst-2
):
Deployment Strategy | Average Latency (ms) | P95 Latency (ms) | Throughput (req/sec) |
---|---|---|---|
PyTorch (native on SageMaker) | 45 | 62 | ~22 |
TensorRT FP16 Engine | 18 | 25 | ~55 |
TensorRT INT8 Engine | 8 | 12 | ~125 |
As the results clearly show, the TensorRT INT8 engine provides an order of magnitude improvement over the baseline PyTorch deployment. The P95 latency is particularly important, as it represents the worst-case user experience.
Handling Advanced Edge Cases
* Accuracy Degradation with INT8: If you observe a significant drop in accuracy after INT8 quantization, your first step is to analyze your calibration dataset. Is it truly representative? If improving the data doesn't help, the next step is Quantization-Aware Training (QAT). QAT simulates the effects of quantization during the fine-tuning process, allowing the model to adapt its weights to minimize the quantization error. This is more complex but often recovers most of the lost accuracy.
Engine Portability and CI/CD: A TensorRT .engine
file is not* portable. It is compiled specifically for a TensorRT version and a GPU architecture (e.g., Turing, Ampere, Hopper). In a production MLOps pipeline, your CI/CD system must build a separate engine for each target deployment environment. Your build artifacts should be tagged with the GPU architecture and TRT version (e.g., model-ampere-trt23.10.engine
).
* Cold Starts: Even with a highly optimized engine, the first invocation on a new SageMaker instance can be slow due to container startup and model loading times. For applications requiring consistent low latency, Provisioned Concurrency is the solution. This SageMaker feature keeps a specified number of instances pre-warmed and ready to accept requests, effectively eliminating cold starts at the cost of paying for idle instances.
* Unsupported Operators: When parser.parse(model.read())
fails, it's often due to an unsupported operator. The primary solution is to rewrite the problematic section of the PyTorch model using a sequence of simpler, supported operators. If that's not possible, the fallback is to implement a custom TensorRT plugin using the C++ API. This involves writing your own CUDA kernel for the operation, which is a significant undertaking reserved for cases where no other workaround exists.
Conclusion
Optimizing LLM inference latency is a non-trivial but essential task for building production-ready AI applications. Moving from a general-purpose framework like PyTorch to a specialized inference optimizer like TensorRT unlocks performance gains that are simply unattainable otherwise. By mastering the workflow of ONNX conversion, advanced INT8 quantization with custom calibration, and deployment via custom SageMaker containers, engineering teams can deliver the low-latency, high-throughput experiences that modern applications demand. This process requires a deep understanding of the entire stack, from model architecture to GPU hardware, but the resulting performance improvement justifies the investment for any serious MLOps practice.