New in July 2023

July was a banner month for open-source foundation models. Llama 2 and Stable Diffusion XL redefined their genres for ease of use and output quality. But using these models in production—and achieving enough throughput in a cost-effective manner—can be a challenge. That’s why here at Baseten we’ve been focusing on features and explainers around autoscaling, cold starts, and scale-to-zero.

Llama 2 brings new SOTA to OSS LLMs

There aren’t enough acronyms to describe how exciting Llama 2—the new state of the art (SOTA) in open-source large language models (OSS LLMs)—is to build with. The model comes in 3 variants (7B, 13B, and 70B). The 7B model is small enough to run on an A10, while the 70B model trades blows with GPT-3.5 on results quality. And with the model’s 4k-token context window, it’s the best OSS model yet for chatbots and agents.

Llama 2’s context window matches GPT-3.5 base at 4k tokens

Get started with Llama 2:

Plus: Llama 2 on Baseten takes advantage of Truss’ new streaming output support, so you can stream the model response for a substantially lower time-to-first-token.

Stable Diffusion XL 1.0: Better images, shorter prompts

Stable Diffusion XL 1.0 (SDXL) is a larger, more powerful version of Stable Diffusion that creates high-quality images from shorter prompts. Rather than appending a string of adjectives at the end of your prompt (e.g. “4k cinematic high resolution beautiful artistic photorealism”), use SDXL and just type in exactly what you want to see.

The best way to evaluate a text-to-image model is to give it a try! We took prompts on Twitter and generated a dozen images that show the range of the model’s capabilities and limitations.

Images generated with Stable Diffusion XL 1.0

Get started with Stable Diffusion XL 1.0:

Model autoscaling for cost-effective throughput

Autoscaling is the process of automatically creating and deleting replicas of your machine learning model server in response to incoming traffic. Model traffic is usually inconsistent, so autoscaling helps ensure that you only pay for compute resources that you actually need.

Autoscaling includes scale-to-zero, where a model scales down to zero replicas when not in use—meaning you pay zero dollars while the model is completely scaled down. There is a catch, though: a cold start time on the first request to the model to give the server a chance to scale back up. But we’ve been hammering down cold start times for months to give you reliable, performant autoscaling infrastructure. For example, for Stable Diffusion on an A10G, we reliably see cold start times under 15 seconds, from zero to ready for inference.

To learn more about how autoscaling works, take a look at our recent explainer on autoscaling features.

Autoscaling and scale-to-zero in action

And while autoscaling features don’t kick in until the model is active, you can now avoid resource waste by stopping accidental or unwanted deployments in the Baseten UI.

We’ll be back next month with more open-source models, ML project tutorials, and infrastructure content.

Thanks all!

— The team at Baseten