Changelog

Our latest product additions and improvements.

123…10

Mar 21, 2024

Baseten now offers model inference on NVIDIA H100mig GPUs, available for all customers starting at $0.08250/minute. 

The H100mig family of instances runs on a fractional share of an H100 GPU using Nvidia’s Multi-Instance GPU (MIG) virtualization technology. We were the first inference provider to offer H100s back in February 2024, unlocking an 18 to 45 percent improvement in price to performance vs. equivalent workloads using two or more A100s. With the H100mig GPUs now available on Baseten, customers can take advantage of these performance and cost improvements for smaller workloads, including those currently using a single A100 instance.

H100 pricing and instance types

Baseten currently offers one H100mig and four H100 instance types. See our instance type reference for more details.

Run your model on H100 GPUs

We’ve opened up access to H100 and H100mig GPUs for all customers and plan to aggressively scale our capacity to meet our customers’ needs. Get in touch and tell us about your use case and we’ll help you achieve big performance improvements and cost savings using H100 GPUs for model inference.

Mar 20, 2024

âś•
Get a production deployment's details

We’re excited to share that we’ve created a REST API for managing Baseten models! Unlock powerful use cases outside of the (albeit amazing) Baseten UI - interact with your models programmatically, manage them from CI jobs, and much more.

The following endpoints are currently available:

  • Get and upsert secrets: /v1/secrets

  • Get model information: /v1/models

  • Get deployment information: /v1/models/{model_id}/deployments/{id or env}

  • Update autoscaling settings: /v1/models/{model_id}/deployments/{id or env}/autoscaling_settings

  • Promote deployments: /v1/models/{model_id}/deployments/{id or env}/promote

  • Activate deployments: /v1/models/{model_id}/deployments/{id or env}/activate

  • Deactivate deployments: /v1/models/{model_id}/deployments/{id or env}/deactivate

And more endpoints to come! Check out our REST API documentation here, happy curling!

Mar 7, 2024

Every deployment of an ML model requires certain hardware resources — usually a GPU plus CPU cores and RAM — to run inference. We’ve made it easier to navigate the wide variety of hardware options available on Baseten with a new instance type selection page.

On this page, you can select GPU type (or no GPU at all), GPU count, and associated CPU resources. To help guide you through these options, we have docs on available instance types and a series of blog posts on GPU options. Don’t hesitate to reach out to support@baseten.co if you have any questions about which hardware resources are best for your model inference workload.

âś•

Feb 23, 2024

âś•
Baseten billing dashboard with daily model usage graph

You can now view a daily breakdown of your model usage and billing information to get more insight into usage and costs. Here are the key changes:

  • A new graph displays daily costs, requests, and billable minutes. You can use the filter in the top left corner to view this information for a specific model or model deployment.

  • Billing and usage information is now available for both the current and previous billing period.

  • Request count is now visible from the model usage table.

We’re here to help with any billing and usage questions at support@baseten.co.

Feb 6, 2024

Baseten is now offering model inference on H100 GPUs starting at $9.984/hour. Switching to H100s offers a 18 to 45 percent improvement in price to performance vs equivalent A100 workloads using TensorRT and TensorRT-LLM.

H100 stats

We’re using SXM H100s, which feature:

  • 989.5 teraFLOPs of fp16 tensor compute (vs 312 for 80GB SXM A100)

  • 80 GB of VRAM (matching 80GB SXM A100)

  • 3.35 TB/s memory bandwidth (vs 2.039 for 80GB SXM A100)

Most critically for LLM inference, the H100 offers 64% higher memory bandwidth, though the speedup in compute also helps for compute-bound tasks like prefill (which means much faster time to first token).

Replacing A100 workloads with H100

An instance with a single H100 costs 62% more ($9.984/hr) than a single A100 instance ($6.15). Just by looking at the stat sheet, with a 64% increase in memory bandwidth, you wouldn’t expect any improvement in performance per dollar.

However, thanks to TensorRT, you can save 18 to 45 percent on inference costs for workloads that use two or more A100-based instances by switching to H100s.

The H100 offers more than just increased memory bandwidth and higher core counts. TensorRT optimizes models to run on the H100’s new Hopper architecture, which unlocks additional performance:

  • Running Mistral 7B, we observed approximately 2x higher tokens per second and 2-3x lower prefill time across all batch sizes

  • Running Stable Diffusion XL, we observed approximately 2x lower total generation time across all step counts

With twice the performance at only 62% higher price, switching to H100 offers 18% savings vs A100, with better latency. But if increase concurrency on the H100 until latency reaches A100 benchmarks, you can get as high as three times the throughput — a 45% savings on high-volume workloads.

H100 pricing and instance types

An instance with a single H100 GPU costs $9.984/hour. Instances are available with 2, 4, and 8 H100 GPUs for running larger models that require extra VRAM; pricing scales linearly with GPU count.

Run your model on H100 GPUs

We’ve opened up access to our first batch of H100 GPUs, and plan to aggressively scale our capacity. To enable H100 access in your Baseten account, get in touch and tell us about your use case and we’ll help you achieve substantial cost savings and performance improvements by switching to H100 GPUs for model inference.

Jan 19, 2024

Model library demo video

We’ve totally refreshed our model library to make it easier for you to find, evaluate, deploy, and build on state-of-the-art open source ML models. You can try the new model library for yourself today and deploy models like Mistral 7B, Stable Diffusion XL, and Whisper V3.

Here’s what’s new with the model library:

  • The main library page is filterable and searchable by model name, use case, publisher, and more tags.

  • Each model page features code samples with best practices for calling the deployed model.

  • A new deployment page lets you name your model and confirm instance type and autoscaling settings before deployment.

The model library has a new home at www.baseten.co/library — previously, it was at app.baseten.co/explore. If you had the old link bookmarked, no worries, you’ll be automatically redirected to the new pages.

We have big plans for the model library moving forward, and we’re excited that this new platform makes it easier than ever to share the best open source models. We’ve added new models to the library with this release, including our optimized TensorRT-LLM implementation of Mixtral 8x7B, two popular text embedding models, and Stable Diffusion XL with ControlNet (plus Canny and Depth variants).

Jan 11, 2024

You can now deploy models to instances powered by the L4 GPU on Baseten. NVIDIA’s L4 GPU is an Ada Lovelace series GPU with:

  • 121 teraFLOPS of float16 compute

  • 24 GB of VRAM at a 300 GB/s memory bandwidth

While the L4 is the next-gen successor to the T4, it’s natural to compare it instead to the A10 as both have 24GB of VRAM. However, the two are better suited for different workloads.

Thanks to its high compute power, the L4 is great for:

  • Image generation models like Stable Diffusion XL

  • Batch jobs of Whisper and other transcription tasks

  • Any compute-bound model inference tasks

However, due to lower memory bandwidth, the L4 is not well suited for:

  • Most LLM inference tasks, like running Mistral 7B or Llama 7B for chat

  • Most autoregressive model inference

  • Any memory-bound model inference tasks

L4-based instances start at $0.8484/hour — about 70% of the cost of an A10G-based instance. L4 GPU instances are priced as follows:

If you have any questions about using L4 GPUs for model inference, please let us know at support@baseten.co.

Jan 8, 2024

Video

When deploying with Truss via truss push, you can now assign meaningful names to your deployments using the --deployment-name argument, making them easier to identify and manage. Here's an example: truss push --publish --deployment-name my-new-deployment .

By default, published deployments are named deployment-n, where n increments as you publish new deployments. Development and Production deployments are still labeled accordingly alongside their new name.

To name or rename a deployment, select "Rename deployment" in the deployment's action menu.

Dec 15, 2023

âś•

Autoscaling lets your deployed models handle variable traffic while making efficient use of model resources. We’ve updated some language and default settings to make using autoscaling more intuitive. The updated default values only apply to newly created model deployments.

New autoscaler setting language

We renamed two of the autoscaler’s three settings to better reflect their purpose:

  • Autoscaling window, previously called autoscaling delay, is the timeframe of traffic considered for scaling replicas up and down.

  • Scale down delay, previously called additional scale down delay, is the additional time the autoscaler waits before spinning down a replica.

  • Concurrency target, which is unchanged, is the number of concurrent requests you want each replica to be responsible for handling.

New default values

We also assigned new default values for autoscaling window and scale down delay. These values will only apply to new model deployments.

  • The default value for the autoscaling window is now 60 seconds, where previously it was 1200 seconds (20 minutes).

  • The default value for the scale down delay is now 900 seconds (15 minutes), where previously it was 0 seconds.

  • The default value for concurrency target is unchanged at 1 request.

We made this change because, while autoscaler settings aren't universal, we generally recommend a shorter autoscaling window with a longer scale down delay to respond quickly to traffic spikes while maintaining capacity through variable traffic.

Additionally, the scale down delay setting is now available for all deployments, where previously it was limited to deployments with a max replica count of two or greater.

For more on autoscaler settings, see our guide to autoscaling in the Baseten docs.

Nov 10, 2023

You can now retry failed model builds and deploys directly from the model dashboard in your Baseten workspace.

Model builds and deploys can fail due to temporary issues, like a network error while downloading model weights or Python dependencies. In these cases, simply retrying the process can fix the issue.

To retry a failed build or deploy, click the “Retry build” or “Retry deploy” button in the deployment info on the model dashboard.

âś•
123…10