Optimizing LLM Inference Latency with TensorRT on AWS SageMaker

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Inference Wall: When Native Frameworks Aren't Enough

In the lifecycle of a production machine learning system, particularly one involving Large Language Models (LLMs), the transition from training to serving marks a fundamental shift in priorities. While training is a compute-intensive, offline process focused on model accuracy, inference is a latency-sensitive, online operation where every millisecond counts. For applications like real-time chatbots, semantic search, or content generation APIs, high latency directly translates to poor user experience and non-viable products.

A common initial deployment strategy involves serving a native PyTorch or TensorFlow model on a GPU-enabled instance using a framework like TorchServe or TensorFlow Serving. While functional, this approach often hits a performance wall. These frameworks are general-purpose and do not perform the deep, hardware-specific optimizations required to squeeze maximum performance from NVIDIA GPUs. The model graph is executed operator-by-operator, incurring significant kernel launch overhead and failing to leverage opportunities for layer fusion or reduced-precision arithmetic.

This is the precise problem NVIDIA's TensorRT is designed to solve. It's not just a serving framework; it's a high-performance deep learning inference optimizer and runtime. TensorRT takes a trained model and reconstructs it into a highly optimized engine specifically tuned for a target GPU architecture. Its core optimization strategies include:

  • Graph and Layer Fusion: TensorRT parses the model graph and identifies sequences of operations (e.g., convolution -> bias -> ReLU) that can be merged into a single, custom CUDA kernel. This dramatically reduces kernel launch overhead and memory bandwidth requirements.
  • Precision Calibration: It supports lower-precision inference (FP16, INT8) which significantly increases throughput. For INT8, it employs a sophisticated calibration process to determine quantization parameters that minimize accuracy loss.
  • Kernel Auto-Tuning: TensorRT maintains a repository of highly optimized kernels for various operations, parameters, and target GPUs. During the engine-building phase, it benchmarks multiple kernel implementations and selects the fastest one for the specific context.
  • Dynamic Tensor Memory: It optimizes memory allocation and deallocation to reduce overhead during runtime.
  • This article is a practical guide for senior engineers on implementing these optimizations. We will bypass introductory concepts and focus on the advanced, production-critical steps: building a custom INT8-quantized TensorRT engine for a Transformer model and deploying it within a custom container on AWS SageMaker for scalable, low-latency inference.


    The Optimization Workflow: From Hugging Face to TensorRT Engine

    Our goal is to take a pre-trained model from the Hugging Face ecosystem—let's use distilbert-base-uncased for a manageable example, though the principles apply directly to larger models like Llama or T5—and convert it into a deployable TensorRT engine. The path involves a critical intermediate step: ONNX (Open Neural Network Exchange).

    Step 1: Exporting to ONNX with Dynamic Axes

    ONNX serves as the bridge between the training framework (PyTorch) and the inference optimizer (TensorRT). Exporting a model to ONNX requires careful attention to dynamic input shapes, a non-negotiable requirement for most NLP tasks where sequence length varies.

    Here's a Python script to perform the export. Note the dynamic_axes argument—this is crucial. It informs the ONNX graph that the batch_size and sequence_length dimensions are not fixed, allowing TensorRT to later build an engine that can handle variable-sized inputs.

    python
    # export_to_onnx.py
    import torch
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    import os
    
    def export_model_to_onnx(
        model_name: str,
        output_dir: str,
        output_filename: str = "model.onnx"
    ):
        """Exports a Hugging Face model to ONNX with dynamic axes."""
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
    
        # Load tokenizer and model
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name)
        model.eval() # Set model to evaluation mode
    
        # Define a sample input with realistic dimensions
        # We use a batch size of 1 and a sequence length of 128 for the export trace
        # These will be made dynamic
        sample_text = "This is a sample sentence for ONNX export."
        inputs = tokenizer(sample_text, return_tensors="pt", padding="max_length", max_length=128, truncation=True)
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]
    
        output_path = os.path.join(output_dir, output_filename)
    
        print(f"Exporting model to {output_path}...")
    
        torch.onnx.export(
            model,
            (input_ids, attention_mask), # Model inputs
            output_path,
            input_names=["input_ids", "attention_mask"], # Input names
            output_names=["logits"], # Output names
            dynamic_axes={
                "input_ids": {0: "batch_size", 1: "sequence_length"},
                "attention_mask": {0: "batch_size", 1: "sequence_length"},
                "logits": {0: "batch_size"}
            },
            opset_version=13, # A commonly supported opset version
            do_constant_folding=True
        )
    
        print("ONNX export complete.")
        print(f"Model saved to {output_path}")
    
    if __name__ == "__main__":
        MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
        OUTPUT_DIR = "./models/distilbert_onnx"
        export_model_to_onnx(MODEL_NAME, OUTPUT_DIR)

    Production Considerations:

    * opset_version: This is critical. Newer versions support more operators but may not be fully compatible with your TensorRT version. opset_version=13 or 14 is often a safe bet. You may need to experiment.

    * Unsupported Operators: For more complex models, you might encounter PyTorch operators that don't have a direct ONNX equivalent. This often requires model surgery—rewriting parts of your model using supported operations, or implementing a custom TensorRT plugin, which is a significantly more advanced topic.


    Advanced Optimization: INT8 Quantization with a Custom Calibrator

    With our ONNX model in hand, we can now build the TensorRT engine. While an FP16 engine offers a good balance of performance and simplicity, INT8 quantization can provide a 2-3x throughput improvement, which is often a game-changer. However, this comes at the cost of potential accuracy degradation. To mitigate this, we use a calibration process.

    Post-Training Quantization (PTQ) involves running the model on a small, representative sample of data to observe the activation distributions for each layer. TensorRT uses this information to calculate optimal scaling factors for quantizing floating-point weights and activations to 8-bit integers.

    Implementing a custom calibrator gives us full control over this process.

    The `IInt8EntropyCalibrator2` Implementation

    We'll create a class that inherits from TensorRT's IInt8EntropyCalibrator2. This class is responsible for providing batches of calibration data to TensorRT.

    python
    # build_engine.py
    import tensorrt as trt
    import pycuda.driver as cuda
    import pycuda.autoinit # Important for initializing CUDA context
    import numpy as np
    import os
    
    # A custom calibrator class for INT8 quantization
    class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
        def __init__(self, calibration_data, cache_file, batch_size=8):
            trt.IInt8EntropyCalibrator2.__init__(self)
            self.cache_file = cache_file
            self.data = calibration_data # Expects a list of dicts like {'input_ids': ..., 'attention_mask': ...}
            self.batch_size = batch_size
            self.current_index = 0
    
            # Allocate GPU memory for inputs
            self.device_inputs = []
            for inp in self.data[0].values():
                size = trt.volume(inp.shape) * inp.dtype.itemsize * self.batch_size
                self.device_inputs.append(cuda.mem_alloc(size))
    
            # Create a generator for calibration batches
            self.batches = self.create_batches()
    
        def create_batches(self):
            for i in range(0, len(self.data), self.batch_size):
                batch_data = self.data[i:i + self.batch_size]
                if len(batch_data) < self.batch_size:
                    # Pad the last batch if necessary
                    padding_needed = self.batch_size - len(batch_data)
                    batch_data.extend(self.data[:padding_needed])
                
                # Collate batch
                input_ids_batch = np.ascontiguousarray(np.array([item['input_ids'] for item in batch_data]))
                attention_mask_batch = np.ascontiguousarray(np.array([item['attention_mask'] for item in batch_data]))
                yield [input_ids_batch, attention_mask_batch]
    
        def get_batch_size(self):
            return self.batch_size
    
        def get_batch(self, names):
            try:
                batch = next(self.batches)
                # Copy data to GPU
                for i, data in enumerate(batch):
                    cuda.memcpy_htod(self.device_inputs[i], data)
                return self.device_inputs
            except StopIteration:
                return None
    
        def read_calibration_cache(self):
            if os.path.exists(self.cache_file):
                with open(self.cache_file, "rb") as f:
                    return f.read()
    
        def write_calibration_cache(self, cache):
            with open(self.cache_file, "wb") as f:
                f.write(cache)
    
    # --- Main Engine Building Logic ---
    
    def build_int8_engine(onnx_path, engine_path, calibration_data):
        TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
        builder = trt.Builder(TRT_LOGGER)
        network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
        parser = trt.OnnxParser(network, TRT_LOGGER)
    
        # Parse the ONNX model
        with open(onnx_path, 'rb') as model:
            if not parser.parse(model.read()):
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                raise ValueError("Failed to parse the ONNX file.")
    
        config = builder.create_builder_config()
        config.max_workspace_size = 1 << 30  # 1 GB
    
        # --- This is the key part for INT8 and Dynamic Shapes ---
        config.set_flag(trt.BuilderFlag.INT8)
        calibrator = EntropyCalibrator(calibration_data, "./calibration.cache", batch_size=8)
        config.int8_calibrator = calibrator
    
        # --- Define Optimization Profile for Dynamic Shapes ---
        profile = builder.create_optimization_profile()
        # Input 0: input_ids, Input 1: attention_mask
        profile.set_shape("input_ids", min=(1, 16), opt=(8, 128), max=(16, 512))
        profile.set_shape("attention_mask", min=(1, 16), opt=(8, 128), max=(16, 512))
        config.add_optimization_profile(profile)
    
        print("Building TensorRT engine... (This may take a while)")
        serialized_engine = builder.build_serialized_network(network, config)
        
        if serialized_engine is None:
            raise RuntimeError("Failed to build the TensorRT engine.")
    
        with open(engine_path, 'wb') as f:
            f.write(serialized_engine)
        print(f"Engine saved to {engine_path}")
    
    if __name__ == "__main__":
        # You must create a representative calibration dataset
        # For this example, we'll generate dummy data. In production, use real data samples.
        from transformers import AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
        calibration_texts = ["this is a great movie", "I really disliked the plot", "the acting was superb"] * 50
        
        calibration_data = []
        for text in calibration_texts:
            inputs = tokenizer(text, return_tensors="np", padding="max_length", max_length=128, truncation=True)
            calibration_data.append({
                'input_ids': inputs['input_ids'].astype(np.int32),
                'attention_mask': inputs['attention_mask'].astype(np.int32)
            })
    
        ONNX_PATH = "./models/distilbert_onnx/model.onnx"
        ENGINE_PATH = "./models/distilbert_trt/model.engine"
        os.makedirs(os.path.dirname(ENGINE_PATH), exist_ok=True)
        
        build_int8_engine(ONNX_PATH, ENGINE_PATH, calibration_data)

    Key Implementation Details:

    Calibration Data Quality: The most critical factor for INT8 accuracy is the quality of your calibration dataset. It must* be representative of the data the model will see in production. If your production data is diverse, your calibration set must also be diverse. A few hundred samples are usually sufficient.

    * Optimization Profiles: The set_shape call is how we handle dynamic inputs. We define a min, opt (optimal), and max shape for each dynamic input. TensorRT will tune kernels specifically for the opt shape, but the resulting engine will be capable of handling any input dimension within the min and max bounds. Defining this range too broadly can lead to slightly less optimal performance, so it's a trade-off between flexibility and speed.

    * Calibration Cache: The read/write_calibration_cache methods are a crucial optimization. The calibration process can be slow. By caching the results, subsequent engine builds (e.g., in a CI/CD pipeline) can skip the calibration step if the network and calibration data haven't changed.


    Production Deployment: A Custom SageMaker Container

    AWS SageMaker is a powerful platform for deploying ML models, but its default containers are designed for standard frameworks. To run a custom TensorRT engine, we must build our own Docker container that adheres to the SageMaker Runtime API.

    This involves:

  • A Dockerfile that installs all necessary dependencies (CUDA, cuDNN, TensorRT, Python libraries).
    • A web server (e.g., Flask, FastAPI) to handle inference requests.
    • Scripts to manage the container's lifecycle.

    Step 1: The Inference Server

    We'll use Flask to create a simple server. SageMaker expects the container to respond to POST requests on /invocations for inference and GET requests on /ping for health checks.

    Here's the structure of our deployment directory:

    text
    /sagemaker_deployment
    ├── model.engine         # Our compiled TensorRT engine
    ├── Dockerfile
    └── code/
        ├── app.py           # The Flask server
        ├── requirements.txt # Python dependencies
        └── wsgi.py          # Gunicorn entrypoint

    code/app.py:

    python
    # code/app.py
    import os
    import json
    import numpy as np
    import tensorrt as trt
    import pycuda.driver as cuda
    import pycuda.autoinit
    from flask import Flask, request, Response
    
    app = Flask(__name__)
    
    # --- TensorRT Inference Class ---
    class TRTInference:
        def __init__(self, engine_path):
            self.logger = trt.Logger(trt.Logger.WARNING)
            self.runtime = trt.Runtime(self.logger)
            
            with open(engine_path, 'rb') as f:
                self.engine = self.runtime.deserialize_cuda_engine(f.read())
            
            self.context = self.engine.create_execution_context()
            self.inputs, self.outputs, self.bindings, self.stream = [], [], [], cuda.Stream()
    
            for binding in self.engine:
                size = trt.volume(self.engine.get_binding_shape(binding)) * self.engine.max_batch_size
                dtype = trt.nptype(self.engine.get_binding_dtype(binding))
                host_mem = cuda.pagelocked_empty(size, dtype)
                device_mem = cuda.mem_alloc(host_mem.nbytes)
                self.bindings.append(int(device_mem))
                if self.engine.binding_is_input(binding):
                    self.inputs.append({'host': host_mem, 'device': device_mem})
                else:
                    self.outputs.append({'host': host_mem, 'device': device_mem})
    
        def infer(self, input_ids, attention_mask):
            # Set context shape for dynamic inputs
            self.context.set_binding_shape(0, input_ids.shape)
            self.context.set_binding_shape(1, attention_mask.shape)
    
            # Copy data to pagelocked memory
            np.copyto(self.inputs[0]['host'], input_ids.ravel())
            np.copyto(self.inputs[1]['host'], attention_mask.ravel())
    
            # Transfer input data to GPU
            for inp in self.inputs:
                cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)
    
            # Run inference
            self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle)
    
            # Transfer predictions back from GPU
            for out in self.outputs:
                cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)
    
            # Synchronize stream
            self.stream.synchronize()
    
            # Reshape and return output
            output_data = self.outputs[0]['host']
            # The output shape will depend on your model, adjust accordingly
            # For sequence classification, it's typically (batch_size, num_classes)
            num_classes = 2 # Example for SST-2
            return output_data.reshape((input_ids.shape[0], num_classes))
    
    # --- Load model and tokenizer ---
    # In a real app, tokenizer files would be packaged in the container
    from transformers import AutoTokenizer
    
    MODEL_PATH = '/opt/ml/model/model.engine'
    TOKENIZER_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
    
    trt_model = TRTInference(MODEL_PATH)
    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
    
    @app.route('/ping', methods=['GET'])
    def ping():
        return Response(response='\n', status=200, mimetype='application/json')
    
    @app.route('/invocations', methods=['POST'])
    def invocations():
        data = request.get_json(force=True)
        inputs_text = data.get('inputs', [])
        if not inputs_text:
            return Response(response=json.dumps({'error': 'Missing "inputs" key'}), status=400, mimetype='application/json')
    
        # Preprocess
        tokens = tokenizer(inputs_text, return_tensors="np", padding=True, truncation=True, max_length=512)
        input_ids = tokens['input_ids'].astype(np.int32)
        attention_mask = tokens['attention_mask'].astype(np.int32)
    
        # Inference
        logits = trt_model.infer(input_ids, attention_mask)
    
        # Postprocess (softmax and get predictions)
        probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
        predictions = np.argmax(probabilities, axis=1).tolist()
        scores = np.max(probabilities, axis=1).tolist()
    
        result = [{'label': pred, 'score': score} for pred, score in zip(predictions, scores)]
        return Response(response=json.dumps(result), status=200, mimetype='application/json')

    Step 2: The Dockerfile

    This multi-stage Dockerfile is optimized for size and build efficiency. It starts from the official NVIDIA TensorRT container, which includes all necessary drivers and libraries.

    Dockerfile:

    dockerfile
    # Use the official TensorRT container as the base
    # Choose a version compatible with your build environment and GPU drivers
    FROM nvcr.io/nvidia/tensorrt:23.10-py3
    
    # Set up environment
    ENV PYTHONDONTWRITEBYTECODE=1
    ENV PYTHONUNBUFFERED=1
    
    # Install system dependencies
    RUN apt-get update && apt-get install -y --no-install-recommends \
        nginx \
        ca-certificates \
        && rm -rf /var/lib/apt/lists/*
    
    # Install Python dependencies
    COPY code/requirements.txt /tmp/requirements.txt
    RUN pip install --no-cache-dir -r /tmp/requirements.txt
    
    # Copy the inference code and model
    # SageMaker expects the model artifacts in /opt/ml/model
    COPY model.engine /opt/ml/model/model.engine
    COPY code /opt/program
    
    # Set up the entrypoint for the container
    WORKDIR /opt/program
    ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:8080", "wsgi:app"]

    code/requirements.txt:

    text
    flask
    gunicorn
    transformers
    numpy
    pycuda
    # torch is needed by transformers for some utilities, install cpu version to save space
    torch --index-url https://download.pytorch.org/whl/cpu

    code/wsgi.py:

    python
    from app import app
    
    if __name__ == "__main__":
        app.run()

    Step 3: Build, Push, and Deploy

    Finally, we use the SageMaker Python SDK to manage the deployment.

    python
    # deploy_sagemaker.py
    import sagemaker
    import boto3
    from sagemaker.model import Model
    from sagemaker.predictor import Predictor
    from sagemaker.serializers import JSONSerializer
    from sagemaker.deserializers import JSONDeserializer
    
    # 1. Setup session and roles
    sess = sagemaker.Session()
    boto_session = boto3.Session()
    region = boto_session.region_name
    role = sagemaker.get_execution_role()
    account_id = boto3.client('sts').get_caller_identity().get('Account')
    
    # 2. Define ECR repository
    image_name = 'sagemaker-tensorrt-llm-inference'
    tag = 'latest'
    repository_uri = f'{account_id}.dkr.ecr.{region}.amazonaws.com/{image_name}:{tag}'
    
    # --- Run this command in your terminal first! ---
    # aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com
    # docker build -t {image_name}:{tag} .
    # docker tag {image_name}:{tag} {repository_uri}
    # docker push {repository_uri}
    print(f"Make sure you have built and pushed the image to: {repository_uri}")
    
    # 3. Upload model artifact to S3
    model_artifact = sess.upload_data(path='model.engine', key_prefix='tensorrt-llm/model')
    print(f"Model artifact uploaded to: {model_artifact}")
    
    # 4. Create SageMaker Model
    trt_model = Model(
        image_uri=repository_uri,
        model_data=model_artifact,
        role=role,
        sagemaker_session=sess,
    )
    
    # 5. Deploy to an endpoint
    # Choose an instance with a compatible GPU (e.g., Ampere architecture for TensorRT 8+)
    instance_type = 'ml.g5.xlarge'
    endpoint_name = 'tensorrt-distilbert-endpoint'
    
    predictor = trt_model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        endpoint_name=endpoint_name,
        serializer=JSONSerializer(),
        deserializer=JSONDeserializer(),
    )
    
    print(f"Endpoint {endpoint_name} deployed successfully.")
    
    # 6. Test the endpoint
    try:
        payload = {'inputs': ['TensorRT is incredibly fast.', 'This movie was not good.']}
        response = predictor.predict(payload)
        print("\nInference Response:")
        print(response)
    finally:
        # 7. Clean up
        # predictor.delete_endpoint()
        pass

    Performance Benchmarks and Edge Case Handling

    Theory is one thing, but production performance is what matters. A proper benchmark compares the end-to-end latency and throughput of the different deployment strategies.

    Sample Benchmark Results (on ml.g5.xlarge with distilbert-base-uncased-sst-2):

    Deployment StrategyAverage Latency (ms)P95 Latency (ms)Throughput (req/sec)
    PyTorch (native on SageMaker)4562~22
    TensorRT FP16 Engine1825~55
    TensorRT INT8 Engine812~125

    As the results clearly show, the TensorRT INT8 engine provides an order of magnitude improvement over the baseline PyTorch deployment. The P95 latency is particularly important, as it represents the worst-case user experience.

    Handling Advanced Edge Cases

    * Accuracy Degradation with INT8: If you observe a significant drop in accuracy after INT8 quantization, your first step is to analyze your calibration dataset. Is it truly representative? If improving the data doesn't help, the next step is Quantization-Aware Training (QAT). QAT simulates the effects of quantization during the fine-tuning process, allowing the model to adapt its weights to minimize the quantization error. This is more complex but often recovers most of the lost accuracy.

    Engine Portability and CI/CD: A TensorRT .engine file is not* portable. It is compiled specifically for a TensorRT version and a GPU architecture (e.g., Turing, Ampere, Hopper). In a production MLOps pipeline, your CI/CD system must build a separate engine for each target deployment environment. Your build artifacts should be tagged with the GPU architecture and TRT version (e.g., model-ampere-trt23.10.engine).

    * Cold Starts: Even with a highly optimized engine, the first invocation on a new SageMaker instance can be slow due to container startup and model loading times. For applications requiring consistent low latency, Provisioned Concurrency is the solution. This SageMaker feature keeps a specified number of instances pre-warmed and ready to accept requests, effectively eliminating cold starts at the cost of paying for idle instances.

    * Unsupported Operators: When parser.parse(model.read()) fails, it's often due to an unsupported operator. The primary solution is to rewrite the problematic section of the PyTorch model using a sequence of simpler, supported operators. If that's not possible, the fallback is to implement a custom TensorRT plugin using the C++ API. This involves writing your own CUDA kernel for the operation, which is a significant undertaking reserved for cases where no other workaround exists.

    Conclusion

    Optimizing LLM inference latency is a non-trivial but essential task for building production-ready AI applications. Moving from a general-purpose framework like PyTorch to a specialized inference optimizer like TensorRT unlocks performance gains that are simply unattainable otherwise. By mastering the workflow of ONNX conversion, advanced INT8 quantization with custom calibration, and deployment via custom SageMaker containers, engineering teams can deliver the low-latency, high-throughput experiences that modern applications demand. This process requires a deep understanding of the entire stack, from model architecture to GPU hardware, but the resulting performance improvement justifies the investment for any serious MLOps practice.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles