Baseten Blog

Engineering meets ML infrastructure. Dive into curated insights, expert tutorials, and innovative techniques that make deploying ML models less daunting and more accessible. Explore the topics that resonate with today's tech landscape, and empower your developer journey with expert knowledge.

33% faster LLM inference with FP8 quantization

Quantizing Mistral 7B to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.

High performance ML inference with NVIDIA TensorRT

Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.

40% faster Stable Diffusion XL inference with NVIDIA TensorRT

Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.

GPT vs Mistral: Migrate to open source LLMs seamlessly

Use ChatCompletions API to test open-source LLMs like Mistral 7B in your AI app with just three minor code modifications.

Build your own open-source ChatGPT with Llama 2 and Chainlit

Llama 2 rivals GPT-3.5 in quality and powers ChatGPT. Chainlit helps build ChatGPT-like interfaces. This guide shows creating such interfaces with Llama 2.

Build a chatbot with Llama 2 and LangChain

Build a ChatGPT-style chatbot with open-source Llama 2 and LangChain in a Python notebook.

NVIDIA A10 vs A10G for ML model inference

The A10, an Ampere-series GPU, excels in tasks like running 7B parameter LLMs. AWS's A10G variant, similar in GPU memory & bandwidth, is mostly interchangeable.

NVIDIA A10 vs A100 GPUs for LLM and Stable Diffusion inference

This article compares two popular GPUs—the NVIDIA A10 and A100—for model inference and discusses the option of using multi-GPU instances for larger models.

Understanding NVIDIA’s Datacenter GPU line

This guide helps you navigate NVIDIA’s datacenter GPU lineup and map it to your model serving needs.

Playground v2 vs Stable Diffusion XL 1.0 for text-to-image generation

Playground v2, a new text-to-image model, matches SDXL's speed & quality with a unique AAA game-style aesthetic. Ideal choice varies by use case & art taste.

Stable Video Diffusion now available

Stability AI announced the release of Stable Video Diffusion, marking a huge leap forward for open source novel video synthesis

Open source alternatives for machine learning models

Building on top of open source models gives you access to a wide range of capabilities that you would otherwise lack from a black box endpoint provider.

The benefits of globally distributed infrastructure for model serving

Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.

Why GPU utilization matters for model inference

Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.

Introduction to quantizing ML models

Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.

If You Build It, Devs will Come: How to Host an AI Meetup

Want to host an AI community meetup, but aren’t sure where to start? Julien shares his top 10 tips for successfully hosting an AI meetup.

StartupML AMA: Daniel Whitenack

Meet Daniel, Data Scientist and founding data science team member at SIL

StartupML AMA: Nikhil Harithas

Meet Nikhil, Machine Learning Engineer at Patreon and early data team expert

New in February 2024

3x throughput with H100 GPUs, 40% lower SDXL latency with TensorRT, and multimodal open source models.

New in January 2024

A library for open source models, general availability for L4 GPUs, and performance benchmarking for ML inference

New in December 2023

Faster Mixtral inference, Playground v2 image generation, and ComfyUI pipelines as API endpoints.

How we achieved SOC 2 and HIPAA compliance as an early-stage company

Baseten is a SOC 2 Type II certified and HIPAA compliant platform for fine-tuning, deploying, and serving ML models, LLMs, and AI models.

Baseten achieves SOC 2 Type II certification

Baseten, an MLOps platform for model deployment & fine-tuning, now boasts SOC 2 type 2 certification, ensuring data security, privacy, and confidentiality.

Announcing our Series A

Ending 2019, we founded Baseten to tackle challenges we faced in building ML models in various roles at big and small companies.