Deploying a model to an appropriately sized instance and invoking it in production is a big accomplishment. But if your ML-powered application needs to support high or variable traffic while maintaining a quick response time, there is more work to be done. Fortunately, you don’t need to be an infrastructure engineer to understand the basic mechanics of horizontal scaling.
Horizontal scaling via replicas (Kubernetes terminology for additional copies of your service) with load balancing is an important technique for handling high traffic to an ML model. Let’s examine three tips for understanding how to properly replicate your instances to save users time without wasting your money.
Handling more traffic with replicas
Horizontal scaling with replicas is the sort of solution that’s so obvious it’s surprising that it works. Imagine a checkout lane at a grocery store. There are various things that you can do to increase the speed of the cashiers, but when the line gets long, the simplest way to shorten it is by opening up a second lane. Similarly, if the line of requests against your ML model grows too long, causing unacceptable latency, spinning up a second instance of your server will cut the number of requests waiting in half.
Tip 1: Use replicas when your concern is processing a large number of requests. Adding replicas won’t help small servers run large models.
Autoscaling for variable demand
Cost scales linearly with the number of replicas you have running. If your model has inconsistent traffic, you want more replicas during spikes to handle the load, but you want to get rid of those replicas in quieter times to reduce infrastructure spend. This process is called autoscaling.
Tip 2: Your minimum replica count should almost always be one.
Unless you have a clear reason for keeping up more replicas (such as a maximum latency in the event of a traffic spike) and plenty of extra budget to cover the cost, let your system scale down and avoid unnecessary spend during low-traffic periods.
As your service adds and discards replicas, it will need to route incoming requests to replicas that have capacity to process them. This requires a load balancer. One cool thing about load balancers is that they can be software or hardware, though anyone deploying to a public cloud or hosted platform will only encounter them as software. And many autoscaling solutions offer built-in load balancing, though on AWS you may need to configure your own depending on your use case.
Limits of autoscaling
Autoscaling is a great solution for variable demand, but it has its limitations. If traffic scales up faster than you can add replicas, model serving will be slow until the system has caught up. And with a large number of replicas, even sophisticated load balancers can leave uneven utilization. Plus, adding more replicas multiplies your infrastructure costs.
Tip 3: At some point, the marginal cost of adding another replica outweighs the marginal latency decrease.
Your maximum replica count is most likely constrained by budget. Going from 1 to 2 replicas cuts wait times on the order of 50%, but going from 4 to 5 replicas is on the order of 20% marginal time savings.
In addition to relying on autoscaling, interventions like response caching and model optimization can improve performance and reduce cost. These are outside the scope of this article, but it’s worth remembering that you shouldn’t rely on autoscaling alone to handle major traffic.
Baseten offers built-in autoscaling to allow data scientists to deploy models to large user bases without worrying about infrastructure. Here’s a writeup of how we powered over 4 million requests when Riffusion was #1 on Hacker News.
Fine-tune FLAN-T5 on Blueprint today!
You can now fine-tune FLAN-T5, an instruction-tuned text-to-text transformer model developed by Google, on Blueprint!