Baseten Blog | Page 3

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

Mar 14, 2024

33% faster LLM inference with FP8 quantization

Quantizing Mistral 7B to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.

Pankaj Gupta

1 other

Prompt: A ship in a bottle in a dark wood library

Model performance

Mar 12, 2024

High performance ML inference with NVIDIA TensorRT

Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.

Justin Yi

1 other

Prompt: A friendly robot horse playing in a sunlit meadow

Glossary

Mar 7, 2024

FP8: Efficient model inference with 8-bit floating point numbers

The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.

Pankaj Gupta

1 other

News

Mar 4, 2024

Announcing our Series B

We’ve spent the last four and a half years building Baseten to be the most performant, scalable, and reliable way to run your machine learning workloads.

Tuhin Srivastava

Baseten co-founders Amir, Tuhin, Phil, and Pankaj

Glossary

Mar 1, 2024

The benefits of globally distributed infrastructure for model serving

Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.

Phil Howes

1 other

Prompt: a movie still of a gondola lift in the Alps

Product

Feb 29, 2024

New in February 2024

3x throughput with H100 GPUs, 40% lower SDXL latency with TensorRT, and multimodal open source models.

Baseten

Prompt: A futuristic submarine in a colorful coral reef

Model performance

Feb 22, 2024

40% faster Stable Diffusion XL inference with NVIDIA TensorRT

Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.

Pankaj Gupta

2 others

Prompt: A movie still of an astronaut coming through a technicolor wormhole

Glossary

Feb 20, 2024

Why GPU utilization matters for model inference

Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.

Marius Killinger

1 other

Prompt: A retrofuturistic pickup truck loaded with green plants on a sunny highway

ML models

Feb 9, 2024Revised Jul 26, 2024

The best open source large language model

Explore the best open source large language models for 2024 for any budget, license, and use case.

Philip Kiely

Prompt: A sleek orange robot hoising a trophy on top of a mountain.

1 2 3 4…11