High performance ML inference with NVIDIA TensorRT

Prompt: A friendly robot horse playing in a sunlit meadow

At Baseten, we’ve used NVIDIA TensorRT and TensorRT-LLM to achieve exceptional performance on ML model inference. We’ve seen 40% lower latency on SDXL, sub-200 ms time to first token on Mixtral 8x7B, and 3x higher 7B LLM throughput on H100 GPUs. These early results show the power of TensorRT, but also raise questions — what is TensorRT, what performance benefits can it offer, and how can you leverage it to serve your own models?

“TensorRT-LLM was a breakthrough for Bland. Working with Baseten to optimize all of our GPU processes, we were able to hit our incredibly ambitious latency target for time to first token along with much higher throughput. Our users care deeply about speed, and we’re able to meet their needs because of NVIDIA and Baseten.” — Isaiah Granet, CEO, Bland AI

In this guide, we’ll provide a detailed overview of TensorRT, covering:

The role TensorRT and TensorRT-LLM play in the ML model inference stack.
Using TensorRT to serve models in production.
Performance benchmarks for popular open source models optimized with TensorRT.

We’re excited to share the best practices for model optimization that we’ve learned from working closely with NVIDIA engineers from the TensorRT team.

Introduction to NVIDIA TensorRT

NVIDIA TensorRT is a software development kit for high-performance deep learning inference. Alongside TensorRT, NVIDIA TensorRT-LLM is a Python API for using TensorRT to serve large language models. At Baseten, we use both TensorRT and TensorRT-LLM in production to optimize model performance.

TensorRT takes model weights as input and returns a servable model engine as output. The TensorRT optimization process is run after training and fine-tuning but before inference.

TensorRT works at the model optimization level and enables performant, continuously batched model serving

TensorRT works by taking a model description, such as an ONNX file, and compiling the model to run more efficiently on a given GPU. This optimized model can then be served with lower latency and higher throughput using built engines with C++ and Python runtimes. TensorRT achieves best-in-class performance gains by making optimizations at the CUDA level on compiled models rather than serving raw weights directly.

Requirements for using TensorRT in production

Optimization requires specialization: modifying something with strong general performance to perform even better at a specific task. Optimizing a model for production with TensorRT requires up-front knowledge about your compute needs and traffic patterns. You need to know what you’re optimizing for across:

GPU: TensorRT compiles models to take advantage of specific hardware and architectural features of a given GPU.
Batch size: Batching increases throughput and GPU utilization, giving you more inference for your money, but must be balanced with latency requirements.
Precision: TensorRT comes with various quantization algorithms out of the box, which can enable faster, less expensive model serving.
Input and output shapes: Approximate input and output shapes (e.g. sequence lengths for LLMs) mimic actual usage, enabling further optimization.

TensorRT compiles your model based on the information you provide for these four factors. The compiled model is not portable — if any one of these factors change, you’ll need to compile a new optimized model.

Supported models and hardware

Using TensorRT and TensorRT-LLM in production requires a supported model and a supported GPU.

TensorRT-LLM supports a wide range of large language model families including Mistral, Llama, Qwen, and many others, plus models in other modalities like Whisper and LLaVA. TensorRT itself supports even more models, including Stable Diffusion XL and models with similar architectures.

TensorRT and TensorRT-LLM support NVIDIA’s more recent GPU architectures, including Volta, Turing, Ampere, Ada Lovelace, and Hopper architectures. We’ve found that TensorRT optimizations provide more performance gains on larger, more recent GPUs, such as the A100 and H100 GPUs.

Model performance benchmarks with TensorRT

What level of performance gains do TensorRT and TensorRT-LLM offer? It depends on the model, use case, and GPU. In general, more powerful GPUs, higher traffic, and larger sequence lengths lead to higher performance gains as the more load is on the system, the more there is for TensorRT to optimize.

Below, we’ll share benchmarks for one language model (Mixtral 8x7B) and one image model (SDXL) as examples of the performance gains that are possible with TensorRT.

Benchmarks for Mixtral 8x7B with TensorRT-LLM

We benchmarked Mistral 8x7B with TensorRT-LLM versus a baseline implementation on A100 GPUs. With larger batches, TensorRT offers even greater performance gains, making it useful for cost efficiency while adhering to strict latency requirements — improving overall throughput while keeping excellent time to first token and perceived tokens per second.

Running in float16 with 512 tokens of input and 128 tokens of output, Mistral 8x7B saw 40% lower latency (time to first token) and 60% higher throughput (total tokens per second) on more realistic higher-concurrency workloads.

Time to first token across different batch sizes (lower is better)

Total tokens per second generated by Mixtral (higher is better)

Benchmarks for SDXL with TensorRT

We benchmarked SDXL with TensorRT versus a base implementation on A10G, A100, and H100 GPUs. On larger, more powerful GPUs, TensorRT offers even higher performance gains as it’s able to take full advantage of the GPU’s hardware and features. On an H100 GPU, serving SDXL with TensorRT improves latency by 40% and throughput by 70%.

Inference time at different step counts for SDXL on an A100 GPU (lower is better).

Throughput at different step counts for SDXL on an A100 GPU (higher is better).

Baseten’s collaboration with NVIDIA on TensorRT

TensorRT and TensorRT-LLM are powerful tools for accelerating model inference, but require specialized technical expertise and a clear understanding of your compute needs and traffic patterns to operate in production.

We’ve worked closely with NVIDIA’s technical specialists to understand and productionalize best practices around using TensorRT to serve ML models in production. We’ve written in depth about our process and results using TensorRT to optimize inference for Mistral 7B, Mixtral 8x7B, and SDXL — showcasing how TensorRT on top-of-the-line GPUs leads to world-class performance on latency and throughput sensitive tasks.

Additionally, we’ve productized our model serving work by supporting TensorRT and TensorRT-LLM in Truss, our open source model packaging framework. Out of the box, you get access to all of TensorRT’s model serving features, such as in-flight batching (also known as continuous batching). Get started with production-ready open source implementations of popular models with TensorRT and Truss, including Mistral, Llama, Gemma, and many more examples.

Leverage TensorRT to optimize model performance in production, reducing latency and increasing throughput on high-traffic workloads by deploying them on Baseten. Your models will run securely on our autoscaling infrastructure with scale to zero and fast cold starts. To use TensorRT in production:

Choose a TensorRT-optimized model like Mixtral 8x7B from our model library.
Deploy the model on an autoscaling instance with a powerful GPU in just one click.
Call your new model endpoint for high-performance inference.

High performance ML inference with NVIDIA TensorRT

Introduction to NVIDIA TensorRT

Requirements for using TensorRT in production

Supported models and hardware

Model performance benchmarks with TensorRT

Benchmarks for Mixtral 8x7B with TensorRT-LLM

Benchmarks for SDXL with TensorRT

Baseten’s collaboration with NVIDIA on TensorRT

Related Model performance posts

How to build function calling and JSON mode for open-source and fine-tuned LLMs

How to double tokens per second for Llama 3 with Medusa

How to serve 10,000 fine-tuned LLMs from a single GPU