Our latest product additions and improvements.


Feb 23, 2024

Baseten billing dashboard with daily model usage graph

You can now view a daily breakdown of your model usage and billing information to get more insight into usage and costs. Here are the key changes:

  • A new graph displays daily costs, requests, and billable minutes. You can use the filter in the top left corner to view this information for a specific model or model deployment.

  • Billing and usage information is now available for both the current and previous billing period.

  • Request count is now visible from the model usage table.

We’re here to help with any billing and usage questions at

Feb 6, 2024

Baseten is now offering model inference on H100 GPUs starting at $9.984/hour. Switching to H100s offers a 18 to 45 percent improvement in price to performance vs equivalent A100 workloads using TensorRT and TensorRT-LLM.

H100 stats

We’re using SXM H100s, which feature:

  • 989.5 teraFLOPs of fp16 tensor compute (vs 312 for 80GB SXM A100)

  • 80 GB of VRAM (matching 80GB SXM A100)

  • 3.35 TB/s memory bandwidth (vs 2.039 for 80GB SXM A100)

Most critically for LLM inference, the H100 offers 64% higher memory bandwidth, though the speedup in compute also helps for compute-bound tasks like prefill (which means much faster time to first token).

Replacing A100 workloads with H100

An instance with a single H100 costs 62% more ($9.984/hr) than a single A100 instance ($6.15). Just by looking at the stat sheet, with a 64% increase in memory bandwidth, you wouldn’t expect any improvement in performance per dollar.

However, thanks to TensorRT, you can save 18 to 45 percent on inference costs for workloads that use two or more A100-based instances by switching to H100s.

The H100 offers more than just increased memory bandwidth and higher core counts. TensorRT optimizes models to run on the H100’s new Hopper architecture, which unlocks additional performance:

  • Running Mistral 7B, we observed approximately 2x higher tokens per second and 2-3x lower prefill time across all batch sizes

  • Running Stable Diffusion XL, we observed approximately 2x lower total generation time across all step counts

With twice the performance at only 62% higher price, switching to H100 offers 18% savings vs A100, with better latency. But if increase concurrency on the H100 until latency reaches A100 benchmarks, you can get as high as three times the throughput — a 45% savings on high-volume workloads.

H100 pricing and instance types

An instance with a single H100 GPU costs $9.984/hour. Instances are available with 2, 4, and 8 H100 GPUs for running larger models that require extra VRAM; pricing scales linearly with GPU count.

Run your model on H100 GPUs

We’ve opened up access to our first batch of H100 GPUs, and plan to aggressively scale our capacity. To enable H100 access in your Baseten account, get in touch and tell us about your use case and we’ll help you achieve substantial cost savings and performance improvements by switching to H100 GPUs for model inference.

Jan 19, 2024

Model library demo video

We’ve totally refreshed our model library to make it easier for you to find, evaluate, deploy, and build on state-of-the-art open source ML models. You can try the new model library for yourself today and deploy models like Mistral 7B, Stable Diffusion XL, and Whisper V3.

Here’s what’s new with the model library:

  • The main library page is filterable and searchable by model name, use case, publisher, and more tags.

  • Each model page features code samples with best practices for calling the deployed model.

  • A new deployment page lets you name your model and confirm instance type and autoscaling settings before deployment.

The model library has a new home at — previously, it was at If you had the old link bookmarked, no worries, you’ll be automatically redirected to the new pages.

We have big plans for the model library moving forward, and we’re excited that this new platform makes it easier than ever to share the best open source models. We’ve added new models to the library with this release, including our optimized TensorRT-LLM implementation of Mixtral 8x7B, two popular text embedding models, and Stable Diffusion XL with ControlNet (plus Canny and Depth variants).

Jan 11, 2024

You can now deploy models to instances powered by the L4 GPU on Baseten. NVIDIA’s L4 GPU is an Ada Lovelace series GPU with:

  • 121 teraFLOPS of float16 compute

  • 24 GB of VRAM at a 300 GB/s memory bandwidth

While the L4 is the next-gen successor to the T4, it’s natural to compare it instead to the A10 as both have 24GB of VRAM. However, the two are better suited for different workloads.

Thanks to its high compute power, the L4 is great for:

  • Image generation models like Stable Diffusion XL

  • Batch jobs of Whisper and other transcription tasks

  • Any compute-bound model inference tasks

However, due to lower memory bandwidth, the L4 is not well suited for:

  • Most LLM inference tasks, like running Mistral 7B or Llama 7B for chat

  • Most autoregressive model inference

  • Any memory-bound model inference tasks

L4-based instances start at $0.8484/hour — about 70% of the cost of an A10G-based instance. L4 GPU instances are priced as follows:

If you have any questions about using L4 GPUs for model inference, please let us know at

Jan 8, 2024


When deploying with Truss via truss push, you can now assign meaningful names to your deployments using the --deployment-name argument, making them easier to identify and manage. Here's an example: truss push --publish --deployment-name my-new-deployment .

By default, published deployments are named deployment-n, where n increments as you publish new deployments. Development and Production deployments are still labeled accordingly alongside their new name.

To name or rename a deployment, select "Rename deployment" in the deployment's action menu.

Dec 15, 2023

Autoscaling lets your deployed models handle variable traffic while making efficient use of model resources. We’ve updated some language and default settings to make using autoscaling more intuitive. The updated default values only apply to newly created model deployments.

New autoscaler setting language

We renamed two of the autoscaler’s three settings to better reflect their purpose:

  • Autoscaling window, previously called autoscaling delay, is the timeframe of traffic considered for scaling replicas up and down.

  • Scale down delay, previously called additional scale down delay, is the additional time the autoscaler waits before spinning down a replica.

  • Concurrency target, which is unchanged, is the number of concurrent requests you want each replica to be responsible for handling.

New default values

We also assigned new default values for autoscaling window and scale down delay. These values will only apply to new model deployments.

  • The default value for the autoscaling window is now 60 seconds, where previously it was 1200 seconds (20 minutes).

  • The default value for the scale down delay is now 900 seconds (15 minutes), where previously it was 0 seconds.

  • The default value for concurrency target is unchanged at 1 request.

We made this change because, while autoscaler settings aren't universal, we generally recommend a shorter autoscaling window with a longer scale down delay to respond quickly to traffic spikes while maintaining capacity through variable traffic.

Additionally, the scale down delay setting is now available for all deployments, where previously it was limited to deployments with a max replica count of two or greater.

For more on autoscaler settings, see our guide to autoscaling in the Baseten docs.

Nov 10, 2023

You can now retry failed model builds and deploys directly from the model dashboard in your Baseten workspace.

Model builds and deploys can fail due to temporary issues, like a network error while downloading model weights or Python dependencies. In these cases, simply retrying the process can fix the issue.

To retry a failed build or deploy, click the “Retry build” or “Retry deploy” button in the deployment info on the model dashboard.

Oct 31, 2023

We've made some big changes to the model management experience to clarify the model lifecycle and better follow concepts you're already familiar with as a developer. These changes aren't breaking – they'll just make it easier for you to deploy and serve your models performantly, scalably, and cost-effectively.

If you’ve already deployed models on Baseten and know your way around the old model overview pages, you’ll notice some changes including: 

  • Deployments: Draft and primary versions are now development and production deployments.

  • Calling your model: You can now easily test your model in Baseten by calling it from the overview page.

  • New predict endpoints: We've added new predict endpoints with a simpler response format. You will need to change the way you parse your model output if you decide to switch to the new endpoints.

  • Observability and control: The model overview page includes additional model metadata and new actions you can take on each deployment.


We’ve moved away from semantic versioning in favor of deployments with deployment IDs. 

A development deployment in the process of being promoted to production

The new model overview page highlights two special deployments:

  • Development deployment (formerly draft version): This deployment is designed for quick iteration and testing, with live reload so you can patch changes onto the model server while it runs. It’s limited to a maximum of one replica and always scales to zero after inactivity.

  • Production deployment (formerly primary version): Promote your development deployment to production when you’re ready to use your model for a production use case. The production deployment and all other published deployments have access to full autoscaling to meet the demands of high and variable traffic.

The deployments section of the model overview page

All of your deployments are listed beneath the development and production deployments:

  • For workspaces with multiple users, you now get visibility into who in your workspace created each model deployment.

  • You can set different autoscaling settings for each deployment and get at-a-glance visibility into how many replicas are scaled up at a given time.

  • We’ve added new actions to model deployments. In addition to activating, deactivating, and deleting deployments, click into the action menu to:

    • Wake the deployment if it’s scaled to zero.

    • Download the deployment’s Truss.

    • Stop an in progress deployment.

    • Manage the deployment’s autoscaling settings.

Calling your model

The call model modal where you can find deployment endpoints and call the model

The predict endpoint format has changed so that you can call the current development or production deployment without referencing a specific deployment ID. The old endpoints will continue working so you can continue using them if you’d like. Here’s how the new endpoints are formatted: 

  • To call the current production deployment: https://model-<model-id>

  • To call the current development deployment: https://model-<model-id>

  • To call another deployment: https://model-<model-id><deployment-id>/predict

With the new model endpoints, we’ve also changed the output format of the model response. New responses are no longer wrapped in a JSON dictionary, which removes a step in parsing model output. You only need to change the way you parse your model output if you switch to these new endpoints.

Call a model deployment from within Baseten

We've also added new functionality to the "Call model" dialog:

  • You can generate an API key in one click instead of going to your account settings.

  • You can now test your model by calling it from the model overview page within Baseten. Model library models even come with sample inputs you can call right away.

Observability and control

You’ll notice a handful of other changes as you explore your model overview pages:

  • Metrics have moved to their own tab. The overview tab still provides high-level visibility into the number of calls and median response time in the last hour. Click over to the metrics tab to dive deeper into model traffic, performance, and GPU usage.

  • When choosing an instance type you can choose to view pricing per minute, per hour, or per day.

  • You can quickly see a running total of how much you’ve spent on the model this billing period and then drill deeper into usage costs by deployment and instance type.

The model metrics tab showing end-to-end response time in the last hour

For more on model management features on Baseten, check out all-new docs for:

That’s it for now. We’re eager to know what you think. Please reach out to us with questions and feedback on these changes – we’re at

Oct 27, 2023

We added workspace API keys to give you more control over how you call models, especially in production environments.

There are now two types of API keys on Baseten:

  • Personal keys are tied to your Baseten account. This is the type of API key that previously existed and these keys are unchanged — they can still be used to deploy, call, and manage models.

  • Workspace keys are shared across your entire Baseten workspace. When you create a workspace API key, you can grant it full access to the workspace or limit it to only being able to call and wake selected models.

Every action taken with a personal API key is associated with the matching user account. Use account-level API keys for deploying and testing models and use workspace-level API keys in automated actions and production environments.

Workspace and personal API keys

Oct 16, 2023

The model IDs for some models deployed on Baseten have been changed.

This is not a breaking change. All existing model invocations using the old model IDs will continue to be supported. You do not need to take any action as a result of this change. If you wish, you may use the new model IDs for future model invocations.

This change makes model IDs case-insensitive and was made in preparation for upcoming improvements to model performance and scalability.