Machine learning infrastructure that just works 

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalably, and cost-efficiently.

Get started in minutes. Avoid getting tangled in complex deployment processes. Deploy best-in-class open-source models and take advantage of optimized serving for your own models.

Learn more about model deployment
baseten.co

Prompt

Browser content
$

truss init -- example stable-diffusion-2-1-base ./my-sd-truss

$

cd ./my-sd-truss

$

export BASETEN_API_KEY=MdNmOCXc.YBtEZD0WFOYKso2A6NEQkRqTe

$

truss push

INFO

Serializing Stable Diffusion 2.1 truss.

INFO

Making contact with Baseten 👋 👽

INFO

🚀 Uploading model to Baseten 🚀

Upload progress: 0% |                         | 0.00G/2.39G

Open-source model packaging. Meet Truss, a seamless bridge from model development to model delivery. Truss presents an open-source standard for packaging models built in any framework for sharing and deployment in any environment, local or production.

Wheel
1from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer, TextIteratorStreamer
2from threading import Thread
3import torch
4  
5class Model:
6    def __init__(self, **kwargs):
7        self._secrets = kwargs["secrets"]
8        self._model = None
9        self._tokenizer = None
10
11    def load(self):
12        self._model = LlamaForCausalLM.from_pretrained(
13            "meta-llama/Llama-2-70b-chat-hf", 
14            use_auth_token=self._secrets["hf_access_token"], 
15            device_map="auto",
16            torch_dtype=torch.float16,
17        )
18        self._tokenizer = LlamaTokenizer.from_pretrained(
19            "meta-llama/Llama-2-70b-chat-hf", 
20            use_auth_token=self._secrets["hf_access_token"],
21            torch_dtype=torch.float16,
22        )
23        
24    def predict(self, model_input):
25        prompt = model_input.pop("prompt")
26        stream = model_input.pop("stream", False)
27        return self.forward(prompt, stream, **request)
$ truss init -- example stable-diffusion-2-1-base .my/sd $ cd ./my-sd-truss $ export BASETEN_API_KEY=MdNmOCXc.YBtEZD0WFOYKso2A6NEQk $ truss push INFO Serializing Stable Diffusion 2.1 truss. INFO Making contact with Baseten 👋 👽
Production#6wgzg4qActive on 2xA10GLast deployed 2 days agoReplicas2 of 4 activeScale down delay--Inference (last hour)233callsResponse time (median)832msConfigure auto-scalingView metrics
baseten baseten.login(“”) model = baseten.deployed_model_id()
output = model.predict({: }) (output)import print'1234abcd'“prompt”“some prompt”API_KEY_HERE1 2 3 4 5 6 7Invoke Llama 2

Highly performant infra that scales with you. We've built Baseten as a horizontally scalable service that takes you from prototype to production. As your traffic increases, our infrastructure automatically scales to keep up with it; there's no extra configuration required.

Learn more about autoscaling

Faster and better 

We've optimized every step of the pipeline — building images, starting containers and caching models, provisioning resources, and fetching weights — to ensure models scale-up from zero to ready for inference as quickly as possible.

RUNNING SDXL 1.0 ON NVIDIA A10G
WITHOUT BASETEN

05:00

WITH BASETEN

00:09

Logs and health metrics. We've built Baseten to serve production-grade traffic to real users. We provide reliable logging and monitoring for every model deployed to ensure there's visibility into what's happening under the hood at every step.

Learn more about logs and metrics
Search model logs
Sep 18 8:33:08am
Build was successful. Deploy task is starting.
Sep 18 8:33:08am
Configuring Resources to match user provided values
Sep 18 8:33:08am
Requesting 7400 millicpus
Sep 18 8:33:08am
Requesting 28300 MiB of memory
Sep 18 8:33:08am
Requesting 1 GPUs
Sep 18 8:33:09am
Creating the Baseten Inference Service.
Sep 18 8:33:09am
Waiting for model service to spin up. This might take a minute.
Sep 18 8:33:13am
starting uvicorn with 1 workers
Sep 18 8:33:13am
Executing model.load()...
Sep 18 8:33:13am
Application startup complete.
Sep 18 8:33:14am
Created a temporary directory at /tmp/tmpfjq60a0e
Sep 18 8:33:14am
Writing /tmp/tmpfjq60a0e/_remote_module_non_scriptable.py
Sep 18 8:33:16am
Fetching 17 files:  12%|██████████| 2/17 [00:00<00:01, 7.81it/s]
Sep 18 8:33:16am
Downloading (…)tokenizer/merges.txt: 525kB [00:00, 8.44MB/s]
Sep 18 8:33:16am
Fetching 17 files:  41%|██████████| 7/17 [00:00<00:00, 19.08it/s]
Sep 18 8:33:16am
Downloading (…)kenizer_2/merges.txt: 525kB [00:00, 7.19MB/s]
Sep 18 8:33:16am
Fetching 17 files: 100%|██████████| 17/17 [00:00<00:00, 25.90it/s]
Sep 18 8:33:17am
[Coldboost] starting uvicorn with 1 workers
Sep 18 8:33:19am
[Coldboost] Writing /tmp/tmp4lra5yau/_remote_module_non_scriptable.py
Sep 18 8:33:21am
Fetching 9 files: 0%|██████████| 0/9 [00:00<?, ?it/s]
Sep 18 8:33:21am
Downloading (…)kenizer_2/merges.txt: 525kB [00:00, 7.85MB/s]
Sep 18 8:33:21am
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 27.66it/s]
Sep 18 8:33:22am
Loading pipeline components...: 100%|██████████| 5/5 [00:00<00:00, 8.07it/s]
Sep 18 8:33:23am
Completed model.load() execution in 10167 ms
Sep 18 8:33:26am
[Coldboost] Fetching 17 files:  12%|██████████| 2/17 [00:00<00:01, 13.25it/s]
Sep 18 8:33:26am
[Coldboost] Downloading (…)tokenizer/merges.txt: 525kB [00:00, 23.3MB/s]
Sep 18 8:33:26am
[Coldboost] Fetching 17 files:  41%|██████████| 7/17 [00:00<00:00, 20.70it/s]
Sep 18 8:33:27am
[Coldboost] Downloading (…)kenizer_2/merges.txt: 525kB [00:00, 17.4MB/s]
Sep 18 8:33:27am
[Coldboost] Fetching 17 files: 100%|██████████| 17/17 [00:00<00:00, 31.14it/s]
Sep 18 8:33:27am
Deploy was a success.
Listening for new logs...

Resource management. Customize the infrastructure running your model. We provide access to latest and greatest infrastructure to run your models on. It's easy to configure and pricing is transparent.

Learn more about customization
Select an instance type
T4x4x16
1 T4 GPU, 16 GiB VM, 4 vCPUs, 16 GiB
$0.01052/min
T4x16x64
1 T4 GPU, 16 GiB VM, 16 vCPUs, 64 GiB
$0.02408/min
A10Gx4x16
1 A10s GPU, 24 GiB VM, 4 vCPUs, 16 GiB
$0.02012/min
A10Gx16x64
1 A10 GPU, 24 GiB VM, 16 vCPUs, 64 GiB
$0.03248/min
A10G:2x24x96
2 A10 GPU, 48 GiB VM, 24 vCPUs, 96 GiB
$0.05672/min
A100x12x144
1 A100 GPU, 80 GiB VM, 12 vCPUs, 144 GiB
$0.10240/min
1x2
1 vCPU, 2GiB RAM
$0.00058/min
1x4
1 vCPU, 4GiB RAM
$0.0008/min
2x8
2 vCPUs, 8GiB RAM
$0.00173/min
4x16
4 vCPUs, 16GiB RAM
$0.00346/min
8x32
8 vCPUs, 32GiB RAM
$0.00691/min
16x64
16 vCPUs, 64GiB RAM
$0.01382/min

Deploy on your own infrastructure or use AWS or GCP credits 

COMING SOON
Use your AWS or GCP credits on Baseten

Are you a startup with AWS and GCP credits? Use them on Baseten.

FOR ENTERPRISE
Deploy on your own infrastructure

Are you an enterprise that wants to self-host Baseten, or utilize compute across multiple clouds? Baseten is easily deployable inside any modern cloud. Your models and data don't need to leave your VPC.

Case study

Patreon saves nearly $600k/year in ML resources with Baseten

With Baseten, Patreon deployed and scaled the open-source foundation model Whisper at record speed without hiring an in-house ML infra team.

Case study

Laurel ships ML models 9+ months faster using Baseten

To automatically categorize hundreds of thousands of time entries every day, Laurel leverages sophisticated ML models and Baseten’s product suite.