New in March 2024

TL;DR

After announcing our series B at the beginning of March, we’ve focused on model performance and developer experience this month. From FP8 to TensorRT-LLM to Multi-Instance GPUs, there are now more options than ever for serving your models on Baseten with the highest possible throughput and tokens per second and the lowest possible latency and cost. And you can now automate key model management tasks with our new REST API endpoints.

Benchmarking fast Mistral inference

A couple weeks ago, we announced best in class performance on latency and throughput metrics for Mistral 7B as measured by independent researchers at Artificial Analysis.

Total response time for Mistral 7B as measured by Artificial Analysis on March 27, 2024

For a more detailed look at the benchmarks we use for model performance, read our guide to benchmarking Mistral 7B, which covers performance metrics across a wide range of batch sizes and sequence shapes, along with notes on essential benchmark configuration decisions like session reuse and tokenizer selection.

Model performance is a rapidly evolving space – the landscape has shifted even in the two weeks since we published our initial results. But our work on Mistral 7B is only the beginning of our commitment to excellence in model performance. We’re actively researching new techniques for even faster inference and bringing our existing optimizations to a wider range of models.

Save 20% vs A100 with H100 MIG

Multi-Instance GPU (MIG) is a feature of recent top-of-the-line GPUs, including the H100, that allows the GPU to be partitioned into independent fractional GPUs that can serve models individually.

Using MIG on H100 GPUs, we’ve created a new H100 MIG 3g.40gb instance type with 3/7 the compute and half the memory of a full H100. These new instances are available in your Baseten workspace.

These fractional H100 GPUs match or exceed the performance of A100 GPUs for many model inference workloads when using TensorRT or TensorRT-LLM. And with a 20% lower list price than A100-backed instances, H100 MIG represents substantial cost savings on high-performance model deployments.

Efficient model inference with FP8 quantization

FP8 is a newly supported data format in NVIDIA Lovelace and Hopper architectures and offers advantages over INT8 for model inference thanks to its higher dynamic range. 

Visualizing FP32, FP16, FP8, and INT8 precisions

We used FP8 for optimizing Mistral 7B inference and found that it created a 33% increase in tokens per second during inference when paired with TensorRT-LLM on an H100 GPU. Unlike our experiments with INT8, FP8 showed a near-zero gain in perplexity, meaning that model output quality was unaffected by quantization.

If you’re interested in serving a model using FP8 and TensorRT-LLM on robust autoscaling infrastructure, we’re here to help at support@baseten.co.

Baseten API for model and workspace management

Use our new model management REST API endpoints to manage models and workspace properties such as secrets.

Here’s an example endpoint for listing all deployed models in a workspace:

curl --request GET \
     --url https://api.baseten.co/v1/models \
     --header 'Authorization: <api-key>'

For an up to date reference on endpoints and return types, see the management API docs. We’ve already published over a dozen endpoints and there are more on the way. If there’s anything you’d like to be able to automate in your workspace with the new API, let us know at support@baseten.co.

We’ll be back next month with more from the world of open source ML!

Thanks for reading,

— The team at Baseten