New in February 2024

TL;DR

Latency, throughput, quality, cost. These factors determine the success of an ML model in production, and weā€™re excited to share Februaryā€™s improvements across all four factors in this newsletter. From lower latency and higher throughput with TensorRT on H100 GPUs to sub-one-second image generation with SDXL Lightning, weā€™ve continued our focus on model performance. New open source models bring multimodal capabilities and best-in-class quality, and our refreshed billing dashboard gives you daily insight into usage and spend.

NVIDIA H100 GPUs for model inference

Weā€™re now offering model inference on H100 GPUs ā€” the worldā€™s most powerful GPU for running ML models.

H100 GPUs feature:

  • 989.5 teraFLOPs of fp16 tensor compute (vs 312 for 80GB A100)

  • 80 GB of VRAM (matching 80GB A100)

  • 3.35 TB/s memory bandwidth (vs 2.039 for 80GB A100)

This translates to extraordinary performance for model inference, especially for models optimized with TensorRT-LLM. In our testing, we saw 3x higher throughput at constant latency for Mistral 7B versus A100 GPUs. This results in a 45% reduction in cost for running high-traffic workloads.

See our H100 changelog for details on pricing and instance types. H100 GPUs are available for all users; you can deploy a model on an H100 GPU today.

40% faster SDXL with TensorRT

TensorRT, a software development kit for high-performance deep learning inference by NVIDIA, is a powerful tool for making models run faster, especially on top-end GPUs like the A100 and H100.

While TensorRT is often used via TensorRT-LLM to optimize language models, you can also use the base TensorRT to optimize a wider range of models. We optimized Stable Diffusion XL with TensorRT and saw 40% lower latency and 70% higher throughput on H100 GPUs compared to a baseline implementation.

āœ•
Percentage improvement in latency and throughput from using TensorRT (higher is better).

Deploy TensorRT-optimized models from our model library to leverage these performance gains in your product.

Real-time image generation with SDXL Lightning

If SDXL on an H100 isnā€™t fast enough for you, consider SDXL Lightning, a new implementation of few-step image generation. SDXL Lightning shows notable improvements over other fast image models like SDXL Turbo, including full 1024x1024 output image size and closer prompt adherence. However, there is still a compromise in quality versus the base SDXL model, especially for highly detailed images.

Deploy SDXL Lightning in one click from our model library and start generating images in less than 1 second per image.

āœ•
Prompt: A rhino wearing a suit

QwenVL: an open source visual language model

Alibaba has released Qwen, a family of open source language models somewhat like Llama 2. Qwen is short for Tongyi Qianwen (通义千问), which we translated to ā€œResponding to any and all of your questions, no matter the subject or the quantity.ā€

Within the Qwen family of models, Qwen VL is unique as a large vision language model. Qwen is able to use natural language to describe images with grounding to identify where in an image each described object lies.

Deploy Qwen VL for a peek into the future of multimodal models that combine vision and language.

Best in class open source text embedding

Nomic Embed v1.5 is a text embedding model that beats OpenAIā€™s text-embedding-3-small on benchmarks while using only half the dimensionality. Nomic Embed v1.5 offers:

  • Optimized embeddings for retrieval, search, clustering, or classification.

  • Adjustable dimensionality with Matryoshka Representation Learning.

Deploy Nomic Embed v1.5 from the Baseten model library for accurate, efficient text embedding.

Product update: improved billing visibility

In February, we released a refreshed billing dashboard that gives you detailed insights into your model usage and associated spend. Hereā€™s what we added:

  • A new graph for daily costs, requests, and billable minutes.

  • Billing and usage information for the previous billing period.

  • Request count visibility within the model usage table.

āœ•
Baseten billing dashboard with daily model usage graph

Weā€™ll be back next month with more from the world of open source ML!

Thanks for reading,

ā€” The team at Baseten