July was a banner month for open-source foundation models. Llama 2 and Stable Diffusion XL redefined their genres for ease of use and output quality. But using these models in production—and achieving enough throughput in a cost-effective manner—can be a challenge. That’s why here at Baseten we’ve been focusing on features and explainers around autoscaling, cold starts, and scale-to-zero.
There aren’t enough acronyms to describe how exciting Llama 2—the new state of the art (SOTA) in open-source large language models (OSS LLMs)—is to build with. The model comes in 3 variants (7B, 13B, and 70B). The 7B model is small enough to run on an A10, while the 70B model trades blows with GPT-3.5 on results quality. And with the model’s 4k-token context window, it’s the best OSS model yet for chatbots and agents.
Get started with Llama 2:
Learn more about Llama 2 in this month’s Models we Love
Plus: Llama 2 on Baseten takes advantage of Truss’ new streaming output support, so you can stream the model response for a substantially lower time-to-first-token.
Stable Diffusion XL 1.0 (SDXL) is a larger, more powerful version of Stable Diffusion that creates high-quality images from shorter prompts. Rather than appending a string of adjectives at the end of your prompt (e.g. “4k cinematic high resolution beautiful artistic photorealism”), use SDXL and just type in exactly what you want to see.
The best way to evaluate a text-to-image model is to give it a try! We took prompts on Twitter and generated a dozen images that show the range of the model’s capabilities and limitations.
Get started with Stable Diffusion XL 1.0:
Read a quickstart guide for SDXL deployment and invocation
Learn more about SDXL in this month’s Models we Love
Autoscaling is the process of automatically creating and deleting replicas of your machine learning model server in response to incoming traffic. Model traffic is usually inconsistent, so autoscaling helps ensure that you only pay for compute resources that you actually need.
Autoscaling includes scale-to-zero, where a model scales down to zero replicas when not in use—meaning you pay zero dollars while the model is completely scaled down. There is a catch, though: a cold start time on the first request to the model to give the server a chance to scale back up. But we’ve been hammering down cold start times for months to give you reliable, performant autoscaling infrastructure. For example, for Stable Diffusion on an A10G, we reliably see cold start times under 15 seconds, from zero to ready for inference.
To learn more about how autoscaling works, take a look at our recent explainer on autoscaling features.
And while autoscaling features don’t kick in until the model is active, you can now avoid resource waste by stopping accidental or unwanted deployments in the Baseten UI.
We’ll be back next month with more open-source models, ML project tutorials, and infrastructure content.
— The team at Baseten