Autoscaling is the ability of your machine learning model to automatically spawn more replicas or terminate replicas in response to the amount of incoming traffic. Because model traffic is rarely ever consistent, autoscaling helps ensure you’re only paying for the computing resources that you’re using. This post will go through the Baseten model resources and autoscaling features, using the Falcon-7B model as an example.
Here’s what you’ll need to follow along in this tutorial:
A Baseten account, which you can sign up for by providing your email address. You’ll be given $30 in credits upon sign-up, and we don’t require a credit card to get started.
A model to work with. I’ll be using the Falcon-7B model, which is available from the Baseten Model Library, but feel free to choose any model in the library that you’re interested in. Each of these models has been configured with sensible defaults.
Learn more about deploying open source models from the Baseten Model Library in this blog post from Emmiliese!
After clicking on the model that you’d like to deploy you’ll see more information about that specific model, along with a green button in the top right corner that reads “Deploy on Baseten.”
Screenshot of the Falcon-7B model page in the Baseten Model Library
Clicking on “Deploy on Baseten” will deploy the model in its published state on Baseten and bring you directly to the model page, where we have access to a tremendous amount of information about our model.
Zooming in on the model resources, we see that we’re running this model on a NVIDIA A10G, and have scale to zero enabled (as evidenced by the moon icon for minimum replicas), and when the model receives traffic, it will utilize, at most, one replica. Each of these features is customizable, so let’s get in there and see what we can do!
When we click “Update resources” from the model page, we’re given access to all of the options for customizing how our model runs.
When we deployed Falcon-7B from the Baseten Model Library, it was automatically deployed on an A10G instance, which it runs well on. Although we set default instances for the models in the Baseten Model Library, you have full control over choosing the right instance for your model, including both GPU and CPU-only instances.
Autoscaling is is the ability of your machine learning model to automatically spawn more replicas or terminate replicas in response to the amount of incoming traffic. On Baseten you can choose both the minimum and maximum number of replicas that your model uses, ensuring that you only pay for the resources that you consume. So if you’ve selected eight maximum replicas, but your model is only receiving an amount of traffic that requires one replica, you’ll only be charged for the use of one replica. In other words, you can think of the maximum number of replicas as a spending cap–you’re setting the upper limit on the number of replicas you’d like your model to utilize in response to increased traffic.
Because of how our infrastructure is configured, you don’t need to worry about losing traffic if your maximum number of replicas is set too low and you suddenly have a big burst in traffic. All excess traffic is placed in a queue, and processed either as the load in traffic decreases, or you increase the maximum number of replicas.
We love a good autoscaling challenge! Learn more about how we served four million Riffusion requests in two days.
By default, when you deploy a new model, the minimum number of replicas is set to zero and the maximum is set to one. When your minimum number of replicas is set to zero, scale to zero is enabled, which means your model will be put to sleep after a period of inactivity. You can turn scale to zero off by setting your minimum number of replicas to one or higher, but be aware that by doing so your model will always be consuming some compute, even if it’s not receiving any requests.
In addition to setting the minimum and maximum number of replicas for your model, you have three additional controls for how your published model is run: scaling delay, additional scale down delay, and concurrency target.
Scaling delay is the time that needs to pass before replicas are scaled up or down once your model stops receiving traffic. This helps ensure that the autoscaler doesn’t overreact to small changes in traffic, and waits the specified time to see if the changes are stable. The default setting is the maximum amount of time, set at 3,600 seconds (one hour), but you can set this as low as 10 seconds.
Additional scale down delay refers to an additional amount of time that must pass before scaling down a replica, once the aforementioned scaling delay has been accounted for. Because scaling up is more time consuming, it often helps to make sure the reduced traffic has gone away by adding some more time to the scaling decision window. This helps maintain replica capacity when request traffic fluctuates.
Concurrency target refers to the number of requests that are allowed to be made on one replica at any one time. If the value you’ve selected is exceeded, additional replicas will be scaled up to accommodate the requests.
To demonstrate how all of the Baseten autoscaling features work in concert with one another, we ran a load test where we sent a series of sequential requests to our model, and have shared the graphs below. We set the scaling delay to one minute so that our model would quickly scale down, allowing us to show off all of our autoscaling features in action.
Cold starts, which refers to bringing a model up from zero to (at least) one replica after the model has been scaled to zero, are enabled on all Baseten accounts. Going from this zero to one state is particularly difficult in the machine learning world because of the sheer size of the thing that we’re trying to scale–if you’re working with large open source models, the model weights are massive. And because GPU compute is expensive, we don’t want idle compute laying around. Once our model scales to zero and is no longer using computing resources, it’s important for it to respond quickly to any new traffic with minimal delays.
To get an idea of just how fast our cold starts are, for Stable Diffusion on an A10G, we reliably see cold start times of 15 seconds, from zero to ready for inference.
We’ve covered the autoscaling features available to you on Baseten, and done a walkthrough with the Falcon-7B model. Now it’s time to start building! Take any of the models from the Baseten Model Library, or package your own model using Truss, our open source model packaging and deployment framework, and deploy it to Baseten. From there you can invoke your model through either the Baseten client or curl. And be sure to tag us on Twitter–we’d love to see what you build!
If you have any questions or run into issues while working with Baseten, please reach out to us at firstname.lastname@example.org.
A special thank you to my colleagues Amir, Jo, Matt, and Phil for taking the time to share their infrastructure expertise with me!