Our latest product additions and improvements.


Aug 22, 2023

A small but mighty change: a 🚫 icon now clearly marks inactive models in the model dropdown and model version sidebar. The green dot for active models and moon icon for scaled-to-zero models remain unchanged. Telling active vs inactive models apart is essential—it could be a difference of hundreds of dollars a day—so this visual tweak makes that information available at a glance.

An active, inactive, and scaled-to-zero model ‍

Aug 14, 2023

Get cleaner, more accurate insights into your model’s performance and load with the refreshed model metrics charts in each model’s overview tab..

Monitor requests per minute and both mean and peak response times.

Inference volume chart and response time tab

And align cross-reference that demand with essential autoscaling metrics like CPU and GPU usage, plus replica count.

GPU usage chart and GPU memory usage tab

Aug 7, 2023

Baseten now supports streaming model output. Instead of having to wait for the entire output to be generated, you can immediately start returning results to users with a sub-one-second time-to-first-token.

Streaming Llama-2 output

Streaming is supported for all model outputs, but it’s particularly useful for large language model (LLM) output for three reasons:

  1. Partial outputs are actually useful for LLMs. They allow you to quickly get agood sense of what kind of response you’re getting. You can tell a lot from the first few words of output.

  2. LLMs are often used in chat and other applications that are latency-sensitive. User’s expect a snappy UX when interacting with an LLM. 

  3. Waiting for complete responses takes a long time, and the longer the LLM’s output, the longer the wait.

With streaming, the output starts almost immediately, every time.

For more on streaming with Baseten, check out an example LLM with streaming output and a sample project built with Llama 2 and Chainlit.

Jul 22, 2023

If you deploy a model by accident or want to shut down a failed deployment quickly, you can now stop any model deployment before it finishes on the model page.

Stopping a model deployment puts that version in an inactive state. You can then restart the deployment by activating the version. Or, if the version is not needed, you can delete it.

Stop deployment from the model version action menu

Jul 15, 2023

Baseten’s global nav has a new layout that focuses on deployed models and makes it easier to get to your workspace settings and API keys.

Baseten global nav

Jul 1, 2023

Base per-minute CPU and GPU pricing is now 40% lower across all instance types, with volume discounts available on our Pro plan. We started getting a better deal from compute providers and thought those savings should get passed on to you.

For example, you can now serve a model on an A10G for just $1.207/hour, compared to our old price of $2.012/hour. Combined with scale-to-zero, configurable autoscaling, and faster cold starts, these instance prices represent substantial cost savings on deploying and serving ML models.

For more details, see the Baseten pricing page.

Jun 16, 2023

The logs tab on your model pages got a serious glow-up: ANSI formatting, line-by-line links, a copy button in each log. Oh, and logs load much, much faster.

Unified build, deployment, and invocation logs are served live. You can switch to a time range view to hunt down a specific problem, and filter logs to only show warnings or errors.

Model deployment logs

Jun 1, 2023

One of the slowest parts of deploying a model—whether for the first time or as a cold start for a scaled-to-zero model service—is downloading the model weights. These files can exceed 10 GB for some common foundation models.

We developed a network accelerator to speed up model loads from common model artifact stores, including HuggingFace, CloudFront, S3, and OpenAI. Our accelerator employs byte range downloads in the background to maximize the parallelism of downloads.

The network accelerator speeds up downloading large model files by 500-600%.

The network accelerator uses a proxy and sidecar to speed up model downloads. These services run on our SOC 2 Type II certified and HIPAA compliant infrastructure. But, if you prefer to disable this network acceleration for your Baseten workspace, please contact our support team at and we will disable the feature for your workspace.

May 26, 2023

Deploy the latest open-source models like WizardLM, Alpaca, Bark, Whisper, Stable Diffusion, and more from the refreshed and restocked model library.

Baseten's model library offers quick deployments of open-source foundation models

Previously, model library models deployed to your account used a shared instance of the model. This meant you couldn’t adjust resource configurations or view logs and metrics. Now, models from the model library are deployed directly to instances in your workspace, giving you full access to Baseten's model management features.

Existing shared instance model deployments will continue to operate as before, but all new deployments from the model library will use the standard deployment method.

For more on deploying and managing model library models, see the updated documentation.

May 2, 2023

The billing page in your workspace settings has two new capabilities: a model usage dashboard and invoice history panel.

The model usage dashboard breaks down your bill by model

Your model usage dashboard breaks down the billable time and total cost of each active model in your workspace. Usage is tracked by version for models with multiple versions, including deleted versions.

If your account has credits, they will be applied against your bill automatically and shown in the model usage dashboard.

The invoices panel tracks prior invoices

For previous billing periods, use the invoices panel to track and download prior invoices. Total bills shown are net of free credits.