Machine learning infrastructure that just works
Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalably, and cost-efficiently.
Get started in minutes. Avoid getting tangled in complex deployment processes. Deploy best-in-class open-source models and take advantage of optimized serving for your own models.
Learn more about model deploymentPrompt

truss init -- example stable-diffusion-2-1-base ./my-sd-truss
cd ./my-sd-truss
export BASETEN_API_KEY=MdNmOCXc.YBtEZD0WFOYKso2A6NEQkRqTe
truss push
Serializing Stable Diffusion 2.1 truss.
Making contact with Baseten 👋 👽
🚀 Uploading model to Baseten 🚀
Upload progress: 0% | | 0.00G/2.39G
Open-source model packaging. Meet Truss, a seamless bridge from model development to model delivery. Truss presents an open-source standard for packaging models built in any framework for sharing and deployment in any environment, local or production.

1from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer, TextIteratorStreamer
2from threading import Thread
3import torch
4
5class Model:
6 def __init__(self, **kwargs):
7 self._secrets = kwargs["secrets"]
8 self._model = None
9 self._tokenizer = None
10
11 def load(self):
12 self._model = LlamaForCausalLM.from_pretrained(
13 "meta-llama/Llama-2-70b-chat-hf",
14 use_auth_token=self._secrets["hf_access_token"],
15 device_map="auto",
16 torch_dtype=torch.float16,
17 )
18 self._tokenizer = LlamaTokenizer.from_pretrained(
19 "meta-llama/Llama-2-70b-chat-hf",
20 use_auth_token=self._secrets["hf_access_token"],
21 torch_dtype=torch.float16,
22 )
23
24 def predict(self, model_input):
25 prompt = model_input.pop("prompt")
26 stream = model_input.pop("stream", False)
27 return self.forward(prompt, stream, **request)
Highly performant infra that scales with you. We've built Baseten as a horizontally scalable service that takes you from prototype to production. As your traffic increases, our infrastructure automatically scales to keep up with it; there's no extra configuration required.
Learn more about autoscalingFaster and better
We've optimized every step of the pipeline — building images, starting containers and caching models, provisioning resources, and fetching weights — to ensure models scale-up from zero to ready for inference as quickly as possible.
WITHOUT BASETEN
05:00
WITH BASETEN
00:09
Logs and health metrics. We've built Baseten to serve production-grade traffic to real users. We provide reliable logging and monitoring for every model deployed to ensure there's visibility into what's happening under the hood at every step.
Learn more about logs and metricsResource management. Customize the infrastructure running your model. We provide access to latest and greatest infrastructure to run your models on. It's easy to configure and pricing is transparent.
Learn more about customizationSelect an instance type
Deploy on your own infrastructure or use AWS or GCP credits
Use your AWS or GCP credits on Baseten
Are you a startup with AWS and GCP credits? Use them on Baseten.
Deploy on your own infrastructure
Are you an enterprise that wants to self-host Baseten, or utilize compute across multiple clouds? Baseten is easily deployable inside any modern cloud. Your models and data don't need to leave your VPC.
Patreon saves nearly $600k/year in ML resources with Baseten
With Baseten, Patreon deployed and scaled the open-source foundation model Whisper at record speed without hiring an in-house ML infra team.
Laurel ships ML models 9+ months faster using Baseten
To automatically categorize hundreds of thousands of time entries every day, Laurel leverages sophisticated ML models and Baseten’s product suite.