Baseten Blog | Page 2

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

May 29, 2024

Control plane vs workload plane in model serving infrastructure

A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.

Colin McGrath

2 others

Prompt: an intricate metal mobile of our solar system

Glossary

May 9, 2024

Comparing tokens per second across LLMs

To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.

Philip Kiely

Product

May 1, 2024

New in April 2024

Use four new best in class LLMs, stream synthesized speech with XTTS, and deploy models with CI/CD

Baseten

Prompt: the steps and entrance to a solarpunk museum

Hacks & projects

Apr 30, 2024

CI/CD for AI model deployments

In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.

Vlad Shulman

3 others

Hacks & projects

Apr 18, 2024

Streaming real-time text to speech with XTTS V2

In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.

Het Trivedi

1 other

Prompt: A wooden boat full of books floating down a rapid river in a Japanese garden

Glossary

Apr 5, 2024

Continuous vs dynamic batching for AI inference

Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.

Matt Howard

1 other

Prompt: A batch of candy being processed on a fantasy assembly line

Product

Mar 28, 2024

New in March 2024

Fast Mistral 7B, fractional H100 GPUs, FP8 quantization, and API endpoints for model management.

Baseten

GPU guides

Mar 28, 2024

Using fractional H100 GPUs for efficient model serving

Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.

Matt Howard

3 others

Prompt: Two tron-style motorcycles racing on an empty highway

Model performance

Mar 14, 2024

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

Abu Qader

3 others

Prompt: a model bullet train in a snowy village.

1 2 3…11