INT8 Post-Training Quantization for LLM Inference on NVIDIA GPUs

October 10, 2025

17 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inference Bottleneck: Moving Beyond FP16

In the realm of large-scale LLM deployment, the transition from training to production inference introduces a formidable set of engineering challenges. While training is a capital-intensive, offline process, inference is an operational expenditure that directly impacts user experience through latency and the bottom line through hardware costs. The standard half-precision formats, FP16 and BF16, while a significant improvement over FP32, still demand substantial VRAM and memory bandwidth, making the deployment of multi-billion parameter models on cost-effective hardware a non-trivial pursuit.

Enter Post-Training Quantization (PTQ). Unlike Quantization-Aware Training (QAT), which requires a costly and complex full retraining cycle to simulate quantization effects, PTQ offers a pragmatic path to optimization. It converts a pre-trained model's weights and activations from a floating-point representation to a lower-precision integer format, typically INT8. This conversion can yield a theoretical 2x reduction in model size and a significant speedup in computation on hardware with specialized INT8 tensor cores, like those found in NVIDIA's Turing, Ampere, and subsequent GPU architectures.

However, this is not a "fire and forget" optimization. A naive PTQ implementation can lead to catastrophic accuracy degradation. The core of successful PTQ lies in a meticulous process called calibration, which determines the optimal mapping from the floating-point domain to the limited integer range. This article dissects the advanced techniques for implementing robust INT8 PTQ for Transformer-based models, focusing on NVIDIA's TensorRT as the inference optimization framework.

The Heart of PTQ: Entropy-Based Calibration

The fundamental challenge in quantization is representing a wide-ranging, continuous distribution of floating-point values (activations) with a mere 256 discrete integer values. The method used to determine the scaling factor S in the affine transformation FP32_value ≈ S * (INT8_value - Z) (where Z is the zero-point) is paramount.

1. Min-Max Calibration: The Brittle Baseline

The simplest approach is to observe the absolute minimum and maximum activation values across a calibration dataset and map this range directly to [-127, 127].

S = max(|min_val|, |max_val|) / 127

This method is fast but extremely sensitive to outliers. A single anomalous activation value can drastically expand the represented range, forcing the vast majority of typical values into a tiny fraction of the available integer buckets. This leads to a significant loss of precision and, consequently, a drop in model accuracy.

2. Entropy Calibration: Minimizing Information Loss

A more robust and production-ready approach is Entropy Calibration, often implemented using Kullback-Leibler (KL) divergence. The goal is no longer to capture the full range of activations but to find a clipping threshold |T| that minimizes the information loss between the original FP32 distribution and the subsequent quantized INT8 distribution.

Here's the process:

For a given activation tensor, collect a histogram of its values from the calibration dataset.

Iterate through various clipping thresholds T, starting from the max value and moving inwards.

For each threshold T, create a temporary quantized distribution: values outside [-T, T] are saturated (clipped), and values inside are distributed across the 256 integer bins.

Calculate the KL divergence between the original distribution (normalized and with clipped values redistributed) and this new quantized distribution.

The optimal threshold T is the one that minimizes the KL divergence. This T is then used to calculate the scaling factor: S = T / 127.

This method effectively ignores rare outliers, preserving the resolution for the bulk of the activation distribution where it matters most. It is the default and recommended method in TensorRT's IInt8EntropyCalibrator2.

Production Implementation with TensorRT

Let's move from theory to a concrete implementation. We will quantize a pre-trained BERT model from the Hugging Face Hub. While smaller than a GPT-class model, the principles are identical and allow for a runnable example.

Prerequisites:

NVIDIA GPU with Tensor Core support (e.g., T4, A100, RTX 30/40 series)
CUDA Toolkit & cuDNN installed

Python environment with tensorrt, pycuda, torch, transformers, datasets.

Step 1: The Calibration Dataset

The quality of your calibration data is the single most important factor for successful PTQ. It must be a small but highly representative sample of the data the model will encounter in production.

Size: Typically 100-1000 samples are sufficient.

Content: It should cover the domain, vocabulary, and sentence structures of your production traffic. Using a generic dataset like WikiText when your application is a legal document summarizer will yield poor results.

For our BERT example, we'll use a subset of the SST-2 dataset.

python

import torch
from datasets import load_dataset
from transformers import BertTokenizer

# Configuration
MODEL_NAME = 'bert-base-uncased'
CALIBRATION_DATASET = 'glue'
CALIBRATION_SUBSET = 'sst2'
NUM_CALIB_SAMPLES = 500
MAX_SEQUENCE_LENGTH = 128
BATCH_SIZE = 32

def load_and_preprocess_calibration_data():
    tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
    dataset = load_dataset(CALIBRATION_DATASET, CALIBRATION_SUBSET, split='train')
    
    # Select a diverse subset
    dataset = dataset.shuffle().select(range(NUM_CALIB_SAMPLES))

    def preprocess(examples):
        return tokenizer(examples['sentence'], padding='max_length', max_length=MAX_SEQUENCE_LENGTH, truncation=True, return_tensors="pt")

    processed_dataset = dataset.map(preprocess, batched=True)
    processed_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids'])
    
    # We only need the inputs for calibration
    calibration_files = []
    for i in range(0, NUM_CALIB_SAMPLES, BATCH_SIZE):
        batch = processed_dataset[i:i+BATCH_SIZE]
        input_ids = batch['input_ids'].numpy()
        attention_mask = batch['attention_mask'].numpy()
        token_type_ids = batch['token_type_ids'].numpy()
        
        # Save batch to a file for the calibrator
        file_name = f'/tmp/calib_batch_{i//BATCH_SIZE}.bin'
        with open(file_name, 'wb') as f:
            f.write(input_ids.tobytes())
            f.write(attention_mask.tobytes())
            f.write(token_type_ids.tobytes())
        calibration_files.append(file_name)
        
    return calibration_files, (BATCH_SIZE, 3, MAX_SEQUENCE_LENGTH) # Shape info

# Generate and save data
# In a real scenario, you'd do this once
# calibration_files, shape_info = load_and_preprocess_calibration_data()

Step 2: The Entropy Calibrator Class

TensorRT requires a calibrator class that it can query for batches of data. This class needs to manage device memory allocation and data transfer.

python

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import os

class BertEntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, calibration_files, batch_size, input_shape, cache_file="./bert_calibration.cache"):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.cache_file = cache_file
        self.batch_size = batch_size
        self.files = calibration_files
        self.current_index = 0
        
        # Input names must match the ONNX model's input names
        self.input_names = ['input_ids', 'attention_mask', 'token_type_ids']
        self.input_shape = input_shape
        
        # Allocate GPU memory for a single batch
        self.device_inputs = {}
        for name in self.input_names:
            # Assuming int64 for BERT inputs
            self.device_inputs[name] = cuda.mem_alloc(self.batch_size * MAX_SEQUENCE_LENGTH * np.dtype(np.int64).itemsize)

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.current_index >= len(self.files):
            return None # No more batches

        try:
            # Load a pre-processed batch from disk
            batch_file = self.files[self.current_index]
            with open(batch_file, 'rb') as f:
                input_ids_data = np.frombuffer(f.read(self.batch_size * MAX_SEQUENCE_LENGTH * 8), dtype=np.int64).reshape(self.batch_size, MAX_SEQUENCE_LENGTH)
                attention_mask_data = np.frombuffer(f.read(self.batch_size * MAX_SEQUENCE_LENGTH * 8), dtype=np.int64).reshape(self.batch_size, MAX_SEQUENCE_LENGTH)
                token_type_ids_data = np.frombuffer(f.read(self.batch_size * MAX_SEQUENCE_LENGTH * 8), dtype=np.int64).reshape(self.batch_size, MAX_SEQUENCE_LENGTH)
            
            # H2D copy
            cuda.memcpy_htod(self.device_inputs['input_ids'], input_ids_data)
            cuda.memcpy_htod(self.device_inputs['attention_mask'], attention_mask_data)
            cuda.memcpy_htod(self.device_inputs['token_type_ids'], token_type_ids_data)
            
            self.current_index += 1
            return [int(self.device_inputs[name]) for name in names]
        except Exception as e:
            print(f"Error in get_batch: {e}")
            return None

    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)

    def free(self):
        for dev_input in self.device_inputs.values():
            dev_input.free()

Key Points:

We pre-process and save data to disk to decouple data loading from the TensorRT calibration process.

The get_batch method is the core. It loads a batch, copies it to the GPU, and returns a list of device pointers.

The cache methods are crucial. Calibration is slow; caching the resulting scale factors table saves significant time on subsequent engine builds.

Step 3: Building the Quantized Engine

Now we tie everything together to build the optimized engine. This involves parsing an ONNX model, configuring the builder, and providing our calibrator.

python

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine(onnx_path, engine_path):
    # First, generate calibration data if not already done
    calibration_files, shape_info = load_and_preprocess_calibration_data()
    batch_size, _, seq_len = shape_info

    with trt.Builder(TRT_LOGGER) as builder, \
         builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network, \
         trt.OnnxParser(network, TRT_LOGGER) as parser:
        
        # Parse the ONNX model
        with open(onnx_path, 'rb') as model:
            if not parser.parse(model.read()):
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None
        print("ONNX model parsed successfully.")

        # Builder configuration
        config = builder.create_builder_config()
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB workspace

        # --- INT8 Configuration --- #
        config.set_flag(trt.BuilderFlag.INT8)
        calibrator = BertEntropyCalibrator(calibration_files, batch_size, (seq_len,), cache_file="./bert_int8_calibration.cache")
        config.int8_calibrator = calibrator
        # --- End INT8 Configuration --- #

        # Set optimization profiles for dynamic shapes if needed
        profile = builder.create_optimization_profile()
        # Assuming BERT inputs are int64, but TRT often prefers int32.
        # Let's define profile for input_ids, attention_mask, token_type_ids
        # Note: Input names must match ONNX graph
        for name in ['input_ids', 'attention_mask', 'token_type_ids']:
            profile.set_shape(name, min=(1, seq_len), opt=(batch_size, seq_len), max=(batch_size * 2, seq_len))
        config.add_optimization_profile(profile)

        print("Building TensorRT engine... (This may take a while)")
        serialized_engine = builder.build_serialized_network(network, config)
        
        if serialized_engine is None:
            print("Failed to build engine.")
            return None

        with open(engine_path, 'wb') as f:
            f.write(serialized_engine)
        print(f"Engine saved to {engine_path}")
        return serialized_engine

# Assuming you have an exported 'bert.onnx' model
# build_engine('bert.onnx', 'bert_int8.engine')

This script executes the entire PTQ pipeline. The config.set_flag(trt.BuilderFlag.INT8) and the assignment of our custom calibrator are the key lines that enable INT8 quantization.

Advanced Topics and Edge Case Handling

Building the engine is only half the battle. Production deployment requires a deeper understanding of potential pitfalls and how to mitigate them.

1. Mixed Precision: When Not to Quantize

Not all layers are created equal. Some operations are highly sensitive to the precision reduction of quantization. The softmax operation in attention mechanisms is a classic example. Its output distribution is often very narrow (one value close to 1.0, others close to 0.0), which quantizes poorly. Quantizing such layers can lead to a significant drop in accuracy.

Solution: Layer-wise Precision Control

TensorRT's builder is intelligent. During the calibration process, if it determines that quantizing a specific layer to INT8 results in a large KL-divergence (high information loss) compared to its FP16/FP32 output, it can automatically keep that layer at a higher precision. This results in a mixed-precision engine.

To see this in action, you can analyze the built engine:

python

# After building the engine
inspector = engine.create_engine_inspector()
layer_info = inspector.get_engine_information(trt.LayerInformationFormat.JSON)
print(layer_info)

Inspecting the JSON output will reveal the precision of each layer. You will often find SoftMax and LayerNorm layers have been kept in Float16 while Convolution and MatMul layers are converted to Int8.

You can also exert manual control. If you've profiled your model and know a specific layer (e.g., MySensitiveLayer_123) must remain in FP32, you can force it:

python

# Inside build_engine, after parsing
for i in range(network.num_layers):
    layer = network.get_layer(i)
    if layer.name == 'MySensitiveLayer_123':
        layer.precision = trt.float32
        layer.set_output_type(0, trt.float32)

This surgical approach is critical for balancing performance gains with accuracy preservation.

2. Performance Benchmarking and Validation

You must rigorously benchmark your quantized engine against the FP16 baseline.

Metrics to Track:

Latency: End-to-end inference time (p95, p99 are more important than average).

Throughput: Inferences per second, especially under concurrent load.

VRAM Usage: Measure with nvidia-smi before and after loading the engine.

Accuracy: This is domain-specific. For classification, it's F1/Precision/Recall. For generative models, it's perplexity or BLEU score. More importantly, run it against a curated validation set of hard cases that are known to be challenging for your model.

Example Benchmark Snippet:

python

import time
# Assume 'engine_fp16' and 'engine_int8' are loaded TensorRT engines
# Assume 'context', 'inputs', 'outputs', 'bindings', 'stream' are set up

def run_inference(context, bindings, stream):
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    stream.synchronize()

def benchmark(engine):
    # ... setup context, bindings, etc. ...
    # Warmup runs
    for _ in range(20):
        run_inference(context, bindings, stream)

    # Timed runs
    timings = []
    for _ in range(100):
        start_time = time.perf_counter()
        run_inference(context, bindings, stream)
        end_time = time.perf_counter()
        timings.append((end_time - start_time) * 1000)
    
    print(f"Average latency: {np.mean(timings):.2f} ms")
    print(f"P95 latency: {np.percentile(timings, 95):.2f} ms")
    print(f"P99 latency: {np.percentile(timings, 99):.2f} ms")

A typical result might look like this:

Precision	Latency (p99, ms)	Throughput (req/s)	VRAM (GB)	Accuracy (SST-2)
FP16	15.2	65	1.5	92.8%
INT8 PTQ	6.8	147	0.8	92.1%

Here, a 0.7% accuracy drop is traded for a >2x improvement in throughput and ~50% VRAM reduction. This is often a highly favorable trade-off in production.

3. Production Pattern: Continuous Calibration

Models in production suffer from data drift. The statistical properties of the live data can slowly change over time, diverging from the original calibration dataset. This can invalidate the calculated scaling factors and degrade the INT8 model's performance.

Solution: A Continuous Calibration Pipeline

Log Production Inputs: Sample and log a small fraction of live inference requests (e.g., 0.1%). Ensure this is done securely and respects user privacy.

Monitor Distribution Shift: Periodically, run statistical tests (e.g., Kolmogorov-Smirnov test) to compare the distribution of the newly collected data against the original calibration dataset.

Trigger Re-calibration: If a significant drift is detected, trigger an automated job.

Automated Re-build: This job uses the new, representative data to re-run the calibration process and build a new INT8 engine.

A/B Test and Deploy: The new engine should be deployed to a canary instance and A/B tested against the current production engine before a full rollout. This closes the loop and ensures the quantization remains optimal over time.

This MLOps pattern is essential for maintaining the long-term health and performance of quantized models in dynamic environments.

Conclusion

Post-Training Quantization is a powerful, indispensable technique for optimizing LLM inference, but it is far from a simple switch. Success hinges on a deep understanding of the calibration process, a commitment to rigorous benchmarking, and the strategic use of advanced features like mixed-precision. By moving beyond basic min-max calibration to entropy-based methods and implementing robust MLOps practices like continuous calibration, senior engineers can unlock substantial performance gains, reduce operational costs, and deploy state-of-the-art models at a scale that would be otherwise prohibitive.

The Inference Bottleneck: Moving Beyond FP16

The Heart of PTQ: Entropy-Based Calibration

Production Implementation with TensorRT

Step 1: The Calibration Dataset

Step 2: The Entropy Calibrator Class

Step 3: Building the Quantized Engine

Advanced Topics and Edge Case Handling

1. Mixed Precision: When Not to Quantize

2. Performance Benchmarking and Validation

3. Production Pattern: Continuous Calibration

Conclusion

Found this article helpful?