Feb 29, 2024

New in February 2024

Prompt: A futuristic submarine in a colorful coral reef

TL;DR

Latency, throughput, quality, cost. These factors determine the success of an ML model in production, and we’re excited to share February’s improvements across all four factors in this newsletter. From lower latency and higher throughput with TensorRT on H100 GPUs to sub-one-second image generation with SDXL Lightning, we’ve continued our focus on model performance. New open source models bring multimodal capabilities and best-in-class quality, and our refreshed billing dashboard gives you daily insight into usage and spend.

NVIDIA H100 GPUs for model inference

We’re now offering model inference on H100 GPUs — the world’s most powerful GPU for running ML models.

H100 GPUs feature:

989.5 teraFLOPs of fp16 tensor compute (vs 312 for 80GB A100)
80 GB of VRAM (matching 80GB A100)
3.35 TB/s memory bandwidth (vs 2.039 for 80GB A100)

This translates to extraordinary performance for model inference, especially for models optimized with TensorRT-LLM. In our testing, we saw 3x higher throughput at constant latency for Mistral 7B versus A100 GPUs. This results in a 45% reduction in cost for running high-traffic workloads.

See our H100 changelog for details on pricing and instance types. H100 GPUs are available for all users; you can deploy a model on an H100 GPU today.

40% faster SDXL with TensorRT

TensorRT, a software development kit for high-performance deep learning inference by NVIDIA, is a powerful tool for making models run faster, especially on top-end GPUs like the A100 and H100.

While TensorRT is often used via TensorRT-LLM to optimize language models, you can also use the base TensorRT to optimize a wider range of models. We optimized Stable Diffusion XL with TensorRT and saw 40% lower latency and 70% higher throughput on H100 GPUs compared to a baseline implementation.

Percentage improvement in latency and throughput from using TensorRT (higher is better).

Deploy TensorRT-optimized models from our model library to leverage these performance gains in your product.

Real-time image generation with SDXL Lightning

If SDXL on an H100 isn’t fast enough for you, consider SDXL Lightning, a new implementation of few-step image generation. SDXL Lightning shows notable improvements over other fast image models like SDXL Turbo, including full 1024x1024 output image size and closer prompt adherence. However, there is still a compromise in quality versus the base SDXL model, especially for highly detailed images.

Deploy SDXL Lightning in one click from our model library and start generating images in less than 1 second per image.

Prompt: A rhino wearing a suit

QwenVL: an open source visual language model

Alibaba has released Qwen, a family of open source language models somewhat like Llama 2. Qwen is short for Tongyi Qianwen (通义千问), which we translated to “Responding to any and all of your questions, no matter the subject or the quantity.”

Within the Qwen family of models, Qwen VL is unique as a large vision language model. Qwen is able to use natural language to describe images with grounding to identify where in an image each described object lies.

Deploy Qwen VL for a peek into the future of multimodal models that combine vision and language.

Best in class open source text embedding

Nomic Embed v1.5 is a text embedding model that beats OpenAI’s text-embedding-3-small on benchmarks while using only half the dimensionality. Nomic Embed v1.5 offers:

Optimized embeddings for retrieval, search, clustering, or classification.
Adjustable dimensionality with Matryoshka Representation Learning.

Deploy Nomic Embed v1.5 from the Baseten model library for accurate, efficient text embedding.

Product update: improved billing visibility

In February, we released a refreshed billing dashboard that gives you detailed insights into your model usage and associated spend. Here’s what we added:

A new graph for daily costs, requests, and billable minutes.
Billing and usage information for the previous billing period.
Request count visibility within the model usage table.

Baseten billing dashboard with daily model usage graph

We’ll be back next month with more from the world of open source ML!

Thanks for reading,

— The team at Baseten

New in February 2024

TL;DR

NVIDIA H100 GPUs for model inference

40% faster SDXL with TensorRT

Real-time image generation with SDXL Lightning

QwenVL: an open source visual language model

Best in class open source text embedding

Product update: improved billing visibility

Related Product posts

Using Asynchronous Inference in Production

Baseten Chains Explained: Building Multi-Component AI Workflows at Scale

New in May 2024