Baseten Blog

Engineering meets ML infrastructure. Dive into curated insights, expert tutorials, and innovative techniques that make deploying ML models less daunting and more accessible. Explore the topics that resonate with today's tech landscape, and empower your developer journey with expert knowledge.

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

News

Jun 27, 2024

Introducing Baseten Chains

Learn about Baseten's new Chains framework for deploying complex ML inference workflows across compound AI systems using multiple models and components

Bola Malek

4 others

Model performance

View all Model performance

Jul 23, 2024

How to serve 10,000 fine-tuned LLMs from a single GPU

LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.

Pankaj Gupta

1 other

Prompt: Different-colored friendly robots standing in a field

Mar 14, 2024

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

Abu Qader

3 others

Prompt: a model bullet train in a snowy village.

Mar 14, 2024

33% faster LLM inference with FP8 quantization

Quantizing Mistral 7B to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.

Pankaj Gupta

1 other

Prompt: A ship in a bottle in a dark wood library

Mar 12, 2024

High performance ML inference with NVIDIA TensorRT

Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.

Justin Yi

1 other

Prompt: A friendly robot horse playing in a sunlit meadow

Hacks & projects

View all Hacks & projects

Jul 25, 2024

Deploying custom ComfyUI workflows as APIs

Easily package your ComfyUI workflow to use any custom node or model checkpoint.

Het Trivedi

1 other

Apr 30, 2024

CI/CD for AI model deployments

In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.

Vlad Shulman

3 others

Apr 18, 2024

Streaming real-time text to speech with XTTS V2

In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.

Het Trivedi

1 other

Prompt: A wooden boat full of books floating down a rapid river in a Japanese garden

Dec 8, 2023

How to serve your ComfyUI model behind an API endpoint

This guide details deploying ComfyUI image generation pipelines via API for app integration, using Truss for packaging & production deployment.

Het Trivedi

1 other

Model: SDXL + ControlNet, Prompt: A top down view of a river through the woods

GPU guides

View all GPU guides

Mar 28, 2024

Using fractional H100 GPUs for efficient model serving

Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.

Matt Howard

3 others

Prompt: Two tron-style motorcycles racing on an empty highway

Nov 28, 2023

NVIDIA A10 vs A10G for ML model inference

The A10, an Ampere-series GPU, excels in tasks like running 7B parameter LLMs. AWS's A10G variant, similar in GPU memory & bandwidth, is mostly interchangeable.

Philip Kiely

Sep 15, 2023

NVIDIA A10 vs A100 GPUs for LLM and Stable Diffusion inference

This article compares two popular GPUs—the NVIDIA A10 and A100—for model inference and discusses the option of using multi-GPU instances for larger models.

Philip Kiely

May 23, 2023

Understanding NVIDIA’s Datacenter GPU line

This guide helps you navigate NVIDIA’s datacenter GPU lineup and map it to your model serving needs.

Philip Kiely

Prompt: A glowing cyberpunk GPU embedded in a field

ML models

View all ML models

Feb 9, 2024Revised Jul 26, 2024

The best open source large language model

Explore the best open source large language models for 2024 for any budget, license, and use case.

Philip Kiely

Prompt: A sleek orange robot hoising a trophy on top of a mountain.

Dec 13, 2023

Playground v2 vs Stable Diffusion XL 1.0 for text-to-image generation

Playground v2, a new text-to-image model, matches SDXL's speed & quality with a unique AAA game-style aesthetic. Ideal choice varies by use case & art taste.

Philip Kiely

Model: Playground v2. Prompt: The meaning of life.

Nov 22, 2023

Stable Video Diffusion now available

Stability AI announced the release of Stable Video Diffusion, marking a huge leap forward for open source novel video synthesis

Sid Shanker

1 other

Koi fish generated by Stable Diffusion XL

Nov 21, 2023

Open source alternatives for machine learning models

Building on top of open source models gives you access to a wide range of capabilities that you would otherwise lack from a black box endpoint provider.

Varun Shenoy

1 other

Prompt: An open door leading to a beautiful garden

Glossary

View all Glossary

Jun 14, 2024

Comparing few-step image generation models

Few-step image generation models like LCMs, SDXL Turbo, and SDXL Lightning can generate images fast, but there's a tradeoff when it comes to speed vs quality.

Rachel Rapp

An AI-generated image of wooden steps in a futuristic setting surrounded by plants, symbolizing few-step image generation.

Jun 4, 2024

How latent consistency models work

Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.

Rachel Rapp

Two trees slightly different in size and color represent how latent consistency models ensure consistency between images.

May 29, 2024

Control plane vs workload plane in model serving infrastructure

A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.

Colin McGrath

2 others

Prompt: an intricate metal mobile of our solar system

May 9, 2024

Comparing tokens per second across LLMs

To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.

Philip Kiely

Community

View all Community

Jul 25, 2024

Ten reasons to join Baseten

Baseten is a Series B startup building infrastructure for AI. We're actively hiring for many roles — here are ten reasons to join the Baseten team.

Dustin Michaels

1 other

May 31, 2024

What I learned as a forward-deployed engineer working at an AI startup

My first six months at Baseten exposed me to a huge range of exciting engineering challenges as I learned how to make an impact as a forward-deployed engineer.

Het Trivedi

Prompt: a software engineer building a bridge out of glowing code

Jun 12, 2023

What I learned from my AI startup’s internal hackathon

See hackathon projects from Baseten for ML infrastructure, inference, user experience, and streaming

Julien Reiman

Prompt: a terrarium in a richly appointed library

Apr 6, 2023

If You Build It, Devs will Come: How to Host an AI Meetup

Want to host an AI community meetup, but aren’t sure where to start? Julien shares his top 10 tips for successfully hosting an AI meetup.

Julien Reiman

Prompt: A gathering in a Roman town square

Product

View all Product

Jul 11, 2024

Using Asynchronous Inference in Production

Learn how async inference works, protects against common inference failures, is applied in common use cases, and more.

Samiksha Pal

2 others

The overall asynchronous inference workflow.

Jul 2, 2024

Baseten Chains Explained: Building Multi-Component AI Workflows at Scale

A Delightful Developer Experience for Building and Deploying Compound ML Inference Workflows

Marius Killinger

1 other

Jun 3, 2024

New in May 2024

AI events, multicluster model serving architecture, tokenizer efficiency, and forward-deployed engineering

Baseten

Prompt: A solarpunk pier for a futuristic water taxi

May 1, 2024

New in April 2024

Use four new best in class LLMs, stream synthesized speech with XTTS, and deploy models with CI/CD

Baseten

Prompt: the steps and entrance to a solarpunk museum

News

View all News

Mar 4, 2024

Announcing our Series B

We’ve spent the last four and a half years building Baseten to be the most performant, scalable, and reliable way to run your machine learning workloads.

Tuhin Srivastava

Baseten co-founders Amir, Tuhin, Phil, and Pankaj

Mar 28, 2023

Baseten announces HIPAA compliance

Baseten is a HIPAA-compliant MLOps platform for fine-tuning, deploying, and monitoring ML models on secure model infrastructure.

Baseten

Mar 8, 2023

How we achieved SOC 2 and HIPAA compliance as an early-stage company

Baseten is a SOC 2 Type II certified and HIPAA compliant platform for fine-tuning, deploying, and serving ML models, LLMs, and AI models.

Baseten

Prompt: A vintage bank in New York City in 1900

Mar 8, 2023

Baseten achieves SOC 2 Type II certification

Baseten, an MLOps platform for model deployment & fine-tuning, now boasts SOC 2 type 2 certification, ensuring data security, privacy, and confidentiality.

Baseten