Baseten Blog
Engineering meets ML infrastructure. Dive into curated insights, expert tutorials, and innovative techniques that make deploying ML models less daunting and more accessible. Explore the topics that resonate with today's tech landscape, and empower your developer journey with expert knowledge.
Introducing Baseten Chains
Learn about Baseten's new Chains framework for deploying complex ML inference workflows across compound AI systems using multiple models and components
Model performance
View all Model performanceHow to serve 10,000 fine-tuned LLMs from a single GPU
LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.
Benchmarking fast Mistral 7B inference
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
33% faster LLM inference with FP8 quantization
Quantizing Mistral 7B to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.
High performance ML inference with NVIDIA TensorRT
Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.
Hacks & projects
View all Hacks & projectsDeploying custom ComfyUI workflows as APIs
Easily package your ComfyUI workflow to use any custom node or model checkpoint.
CI/CD for AI model deployments
In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.
Streaming real-time text to speech with XTTS V2
In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.
How to serve your ComfyUI model behind an API endpoint
This guide details deploying ComfyUI image generation pipelines via API for app integration, using Truss for packaging & production deployment.
GPU guides
View all GPU guidesUsing fractional H100 GPUs for efficient model serving
Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.
NVIDIA A10 vs A10G for ML model inference
The A10, an Ampere-series GPU, excels in tasks like running 7B parameter LLMs. AWS's A10G variant, similar in GPU memory & bandwidth, is mostly interchangeable.
NVIDIA A10 vs A100 GPUs for LLM and Stable Diffusion inference
This article compares two popular GPUs—the NVIDIA A10 and A100—for model inference and discusses the option of using multi-GPU instances for larger models.
Understanding NVIDIA’s Datacenter GPU line
This guide helps you navigate NVIDIA’s datacenter GPU lineup and map it to your model serving needs.
ML models
View all ML modelsThe best open source large language model
Explore the best open source large language models for 2024 for any budget, license, and use case.
Playground v2 vs Stable Diffusion XL 1.0 for text-to-image generation
Playground v2, a new text-to-image model, matches SDXL's speed & quality with a unique AAA game-style aesthetic. Ideal choice varies by use case & art taste.
Stable Video Diffusion now available
Stability AI announced the release of Stable Video Diffusion, marking a huge leap forward for open source novel video synthesis
Open source alternatives for machine learning models
Building on top of open source models gives you access to a wide range of capabilities that you would otherwise lack from a black box endpoint provider.
Glossary
View all GlossaryComparing few-step image generation models
Few-step image generation models like LCMs, SDXL Turbo, and SDXL Lightning can generate images fast, but there's a tradeoff when it comes to speed vs quality.
How latent consistency models work
Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.
Control plane vs workload plane in model serving infrastructure
A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.
Comparing tokens per second across LLMs
To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.
Community
View all CommunityTen reasons to join Baseten
Baseten is a Series B startup building infrastructure for AI. We're actively hiring for many roles — here are ten reasons to join the Baseten team.
What I learned as a forward-deployed engineer working at an AI startup
My first six months at Baseten exposed me to a huge range of exciting engineering challenges as I learned how to make an impact as a forward-deployed engineer.
What I learned from my AI startup’s internal hackathon
See hackathon projects from Baseten for ML infrastructure, inference, user experience, and streaming
If You Build It, Devs will Come: How to Host an AI Meetup
Want to host an AI community meetup, but aren’t sure where to start? Julien shares his top 10 tips for successfully hosting an AI meetup.
Product
View all ProductUsing Asynchronous Inference in Production
Learn how async inference works, protects against common inference failures, is applied in common use cases, and more.
Baseten Chains Explained: Building Multi-Component AI Workflows at Scale
A Delightful Developer Experience for Building and Deploying Compound ML Inference Workflows
New in May 2024
AI events, multicluster model serving architecture, tokenizer efficiency, and forward-deployed engineering
New in April 2024
Use four new best in class LLMs, stream synthesized speech with XTTS, and deploy models with CI/CD
News
View all NewsAnnouncing our Series B
We’ve spent the last four and a half years building Baseten to be the most performant, scalable, and reliable way to run your machine learning workloads.
Baseten announces HIPAA compliance
Baseten is a HIPAA-compliant MLOps platform for fine-tuning, deploying, and monitoring ML models on secure model infrastructure.
How we achieved SOC 2 and HIPAA compliance as an early-stage company
Baseten is a SOC 2 Type II certified and HIPAA compliant platform for fine-tuning, deploying, and serving ML models, LLMs, and AI models.
Baseten achieves SOC 2 Type II certification
Baseten, an MLOps platform for model deployment & fine-tuning, now boasts SOC 2 type 2 certification, ensuring data security, privacy, and confidentiality.