MachineΒ learningΒ infrastructureΒ thatΒ justΒ worksΒ 

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalably, and cost-efficiently.

Get started in minutes. Avoid getting tangled in complex deployment processes. Deploy best-in-class open-source models and take advantage of optimized serving for your own models.

Learn more about model deployment
baseten.co

Prompt

Browser content
$

truss init -- example stable-diffusion-2-1-base ./my-sd-truss

$

cd ./my-sd-truss

$

export BASETEN_API_KEY=MdNmOCXc.YBtEZD0WFOYKso2A6NEQkRqTe

$

truss push

INFO

Serializing Stable Diffusion 2.1 truss.

INFO

Making contact with Baseten πŸ‘‹ πŸ‘½

INFO

πŸš€ Uploading model to Baseten πŸš€

Upload progress: 0% | | 0.00G/2.39G

Open-source model packaging. Meet Truss, a seamless bridge from model development to model delivery. Truss presents an open-source standard for packaging models built in any framework for sharing and deployment in any environment, local or production.

Wheel
1from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer, TextIteratorStreamer
2from threading import Thread
3import torch
4  
5class Model:
6    def __init__(self, **kwargs):
7        self._secrets = kwargs["secrets"]
8        self._model = None
9        self._tokenizer = None
10
11    def load(self):
12        self._model = LlamaForCausalLM.from_pretrained(
13            "meta-llama/Llama-2-70b-chat-hf", 
14            use_auth_token=self._secrets["hf_access_token"], 
15            device_map="auto",
16            torch_dtype=torch.float16,
17        )
18        self._tokenizer = LlamaTokenizer.from_pretrained(
19            "meta-llama/Llama-2-70b-chat-hf", 
20            use_auth_token=self._secrets["hf_access_token"],
21            torch_dtype=torch.float16,
22        )
23        
24    def predict(self, model_input):
25        prompt = model_input.pop("prompt")
26        stream = model_input.pop("stream", False)
27        return self.forward(prompt, stream, **request)
$ truss init -- example stable-diffusion-2-1-base .my/sd $ cd ./my-sd-truss $ export BASETEN_API_KEY=MdNmOCXc.YBtEZD0WFOYKso2A6NEQk $ truss push INFO Serializing Stable Diffusion 2.1 truss. INFO Making contact with Baseten πŸ‘‹ πŸ‘½
Production#6wgzg4qActive on 2xA10GLast deployed 2 days agoReplicas2 of 4 activeScale down delay--Inference (last hour)233callsResponse time (median)832msConfigure auto-scalingView metrics
baseten baseten.login(β€œβ€) model = baseten.deployed_model_id()
output = model.predict({: }) (output)import print'1234abcd'β€œpromptβ€β€œsome prompt”API_KEY_HERE1 2 3 4 5 6 7Invoke Llama 2

Highly performant infra that scales with you. We've built Baseten as a horizontally scalable service that takes you from prototype to production. As your traffic increases, our infrastructure automatically scales to keep up with it; there's no extra configuration required.

Learn more about autoscaling
β€Œ

FasterΒ andΒ betterΒ 

We've optimized every step of the pipeline β€” building images, starting containers and caching models, provisioning resources, and fetching weights β€” to ensure models scale-up from zero to ready for inference as quickly as possible.

RUNNING SDXL 1.0 ON NVIDIA A10G
WITHOUT BASETEN

05:00

WITH BASETEN

00:09

Logs and health metrics. We've built Baseten to serve production-grade traffic to real users. We provide reliable logging and monitoring for every model deployed to ensure there's visibility into what's happening under the hood at every step.

Learn more about logs and metrics
Search model logs
Sep 18 8:33:08am
Build was successful. Deploy task is starting.
Sep 18 8:33:08am
Configuring Resources to match user provided values
Sep 18 8:33:08am
Requesting 7400 millicpus
Sep 18 8:33:08am
Requesting 28300 MiB of memory
Sep 18 8:33:08am
Requesting 1 GPUs
Sep 18 8:33:09am
Creating the Baseten Inference Service.
Sep 18 8:33:09am
Waiting for model service to spin up. This might take a minute.
Sep 18 8:33:13am
starting uvicorn with 1 workers
Sep 18 8:33:13am
Executing model.load()...
Sep 18 8:33:13am
Application startup complete.
Sep 18 8:33:14am
Created a temporary directory at /tmp/tmpfjq60a0e
Sep 18 8:33:14am
Writing /tmp/tmpfjq60a0e/_remote_module_non_scriptable.py
Sep 18 8:33:16am
Fetching 17 files: Β 12%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/17 [00:00<00:01, 7.81it/s]
Sep 18 8:33:16am
Downloading (…)tokenizer/merges.txt: 525kB [00:00, 8.44MB/s]
Sep 18 8:33:16am
Fetching 17 files: Β 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/17 [00:00<00:00, 19.08it/s]
Sep 18 8:33:16am
Downloading (…)kenizer_2/merges.txt: 525kB [00:00, 7.19MB/s]
Sep 18 8:33:16am
Fetching 17 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 17/17 [00:00<00:00, 25.90it/s]
Sep 18 8:33:17am
[Coldboost] starting uvicorn with 1 workers
Sep 18 8:33:19am
[Coldboost] Writing /tmp/tmp4lra5yau/_remote_module_non_scriptable.py
Sep 18 8:33:21am
Fetching 9 files: 0%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 0/9 [00:00<?, ?it/s]
Sep 18 8:33:21am
Downloading (…)kenizer_2/merges.txt: 525kB [00:00, 7.85MB/s]
Sep 18 8:33:21am
Fetching 9 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9/9 [00:00<00:00, 27.66it/s]
Sep 18 8:33:22am
Loading pipeline components...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 8.07it/s]
Sep 18 8:33:23am
Completed model.load() execution in 10167 ms
Sep 18 8:33:26am
[Coldboost] Fetching 17 files: Β 12%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/17 [00:00<00:01, 13.25it/s]
Sep 18 8:33:26am
[Coldboost] Downloading (…)tokenizer/merges.txt: 525kB [00:00, 23.3MB/s]
Sep 18 8:33:26am
[Coldboost] Fetching 17 files: Β 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/17 [00:00<00:00, 20.70it/s]
Sep 18 8:33:27am
[Coldboost] Downloading (…)kenizer_2/merges.txt: 525kB [00:00, 17.4MB/s]
Sep 18 8:33:27am
[Coldboost] Fetching 17 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 17/17 [00:00<00:00, 31.14it/s]
Sep 18 8:33:27am
Deploy was a success.
Listening for new logs...

Resource management. Customize the infrastructure running your model. We provide access to latest and greatest infrastructure to run your models on. It's easy to configure and pricing is transparent.

Learn more about customization
Select an instance type
T4x4x16
1 T4 GPU, 16 GiB VM, 4 vCPUs, 16 GiB
$0.01052/min
L4x4x16
1 L4 GPU, 24 GiB VRAM, 4 vCPUs, 16 GiB
$0.01414/min
A10Gx4x16
1 A10s GPU, 24 GiB VM, 4 vCPUs, 16 GiB
$0.02012/min
A100x12x144
1 A100 GPU, 80 GiB VRAM, 12 vCPUs, 144 GiB
$0.10240/min
H100x26x234
1 H100 GPU, 80 GiB VRAM, 26 vCPUs, 234 GiB
$0.16640/min
1x2
1 vCPU, 2GiB RAM
$0.00058/min
1x4
1 vCPU, 4GiB RAM
$0.0008/min
2x8
2 vCPUs, 8GiB RAM
$0.00173/min
4x16
4 vCPUs, 16GiB RAM
$0.00346/min
8x32
8 vCPUs, 32GiB RAM
$0.00691/min
16x64
16 vCPUs, 64GiB RAM
$0.01382/min

DeployΒ onΒ yourΒ ownΒ infrastructureΒ orΒ useΒ AWSΒ orΒ GCPΒ creditsΒ 

COMING SOON
Use your AWS or GCP credits on Baseten

Are you a startup with AWS and GCP credits? Use them on Baseten.

FOR ENTERPRISE
Deploy on your own infrastructure

Are you an enterprise that wants to self-host Baseten, or utilize compute across multiple clouds? Baseten is easily deployable inside any modern cloud. Your models and data don't need to leave your VPC.