Our latest product additions and improvements.


Sep 7, 2023

Models deployed on Baseten using Truss 0.7.1 or later can now send the 500 response code when there is an error during model invocation. This change only affects newly deployed models.

Any exception raised will result in a 500 response code. For example, this Truss code:

class Model:
    def predict(...):
         raise Exception("hello")

Will yield a response with the following content:

    "error": "Internal Server Error"

For details on a given error, see the model logs for details on the exception that was raised.

Sep 7, 2023

We’ve updated your API key management panel with four key changes:

  1. Dropped randomly generated key names (e.g. legit-artichoke)

  2. Instead, the first 8 characters of the key are displayed to make it easy to identify

  3. Added created field to key list

  4. Added last used field to key list

These changes make it easier to follow best practices for key management, especially for revoking unused or unwanted keys.

API key management

Aug 24, 2023

Every model deployed on Baseten uses Truss, our open-source framework for packaging models with their dependencies, hardware requirements, and more. 

You can now securely download the Truss of any model deployed on your Baseten account. Just click the “Download Truss” button. Then, use Truss’ live reload developer loop to iterate on your model before publishing an updated version to production!

Download the Truss of any model in your Baseten workspace

Aug 22, 2023

A small but mighty change: a 🚫 icon now clearly marks inactive models in the model dropdown and model version sidebar. The green dot for active models and moon icon for scaled-to-zero models remain unchanged. Telling active vs inactive models apart is essential—it could be a difference of hundreds of dollars a day—so this visual tweak makes that information available at a glance.

An active, inactive, and scaled-to-zero model ‍

Aug 14, 2023

Get cleaner, more accurate insights into your model’s performance and load with the refreshed model metrics charts in each model’s overview tab..

Monitor requests per minute and both mean and peak response times.

Inference volume chart and response time tab

And align cross-reference that demand with essential autoscaling metrics like CPU and GPU usage, plus replica count.

GPU usage chart and GPU memory usage tab

Aug 7, 2023

Baseten now supports streaming model output. Instead of having to wait for the entire output to be generated, you can immediately start returning results to users with a sub-one-second time-to-first-token.

Streaming Llama-2 output

Streaming is supported for all model outputs, but it’s particularly useful for large language model (LLM) output for three reasons:

  1. Partial outputs are actually useful for LLMs. They allow you to quickly get agood sense of what kind of response you’re getting. You can tell a lot from the first few words of output.

  2. LLMs are often used in chat and other applications that are latency-sensitive. User’s expect a snappy UX when interacting with an LLM. 

  3. Waiting for complete responses takes a long time, and the longer the LLM’s output, the longer the wait.

With streaming, the output starts almost immediately, every time.

For more on streaming with Baseten, check out an example LLM with streaming output and a sample project built with Llama 2 and Chainlit.

Jul 22, 2023

If you deploy a model by accident or want to shut down a failed deployment quickly, you can now stop any model deployment before it finishes on the model page.

Stopping a model deployment puts that version in an inactive state. You can then restart the deployment by activating the version. Or, if the version is not needed, you can delete it.

Stop deployment from the model version action menu

Jul 15, 2023

Baseten’s global nav has a new layout that focuses on deployed models and makes it easier to get to your workspace settings and API keys.

Baseten global nav

Jul 1, 2023

Base per-minute CPU and GPU pricing is now 40% lower across all instance types, with volume discounts available on our Pro plan. We started getting a better deal from compute providers and thought those savings should get passed on to you.

For example, you can now serve a model on an A10G for just $1.207/hour, compared to our old price of $2.012/hour. Combined with scale-to-zero, configurable autoscaling, and faster cold starts, these instance prices represent substantial cost savings on deploying and serving ML models.

For more details, see the Baseten pricing page.

Jun 16, 2023

The logs tab on your model pages got a serious glow-up: ANSI formatting, line-by-line links, a copy button in each log. Oh, and logs load much, much faster.

Unified build, deployment, and invocation logs are served live. You can switch to a time range view to hunt down a specific problem, and filter logs to only show warnings or errors.

Model deployment logs