Slow dev loops break flow state and make for a frustrating experience.
When a web developer using any modern framework is coding locally, they don’t have to wait for a server to restart, container to rebuild, or dependency chain to resolve before seeing the result of their change. Instead, the update is nearly instant. Web developers have come to expect this live reload experience from their toolchains, and it is time that data scientists and ML engineers expect the same.
But for data scientists, slow dev loops make all but the most essential deployment workflows too expensive and time consuming to even consider. To fix this, Baseten is introducing draft models.
When you make a code change to a draft model, the model server checks if it can hot-swap the new code in place of the code that is currently running. For example, if you update your pre-processing function to parse input differently, that new function can be swapped in and run immediately, without shutting down and rebuilding your model serving environment.
This live reload workflow — redeploying and testing code in real time — is exactly the same superpower that web developers have enjoyed for decades. Live reload makes common deployment tasks 100X faster: from waiting five minutes or more for your container to rebuild to just about three seconds for everything to be updated. Live reload means you can test your model in the context of your entire application and rapidly iterate until you’re happy with the system end-to-end.
One hundred times faster
Spinning up a model server from nothing takes time. First, resources must be allocated to a server. Then a Docker image is built and a proper Python environment is installed. Once everything is ready, the model is loaded onto the server, and the server is configured to accept requests.
This whole process varies in duration based on the complexity of the environment and the size of the model, but it takes a few minutes. Let’s say five. Deploying a model in five minutes as a last step in a workflow is not that bad; the problem is that this five-minute deployment happens over and over again during common dev loops.
Baseten has solved this choke point for common model deployment tasks by enabling live reload on draft models. With draft models, the dev loop for testing an update to your model code gets 100X faster — from about five minutes to about three seconds.
Slow dev loops break flow state
I’ll admit it … every once in a while a few minutes of compile time is nice. Waiting for my code to build, I can stretch, grab a snack, or respond to a Slack message. But when I’m trying to iterate rapidly during development, waiting minutes for each change is just brutal. And in the course of ordinary dev work, these gaps impede finding and staying in a flow state.
With this pain in mind, we set out to radically shorten the dev loop for writing model serving code. We decided that data scientists and ML engineers need live reload for better workflows. Let’s take a look at where and how this affects the model creation process.
How it works
Creating draft models
Deploying a draft model just requires a single flag in the baseten.deploy() command. Here’s a simple example:
Your model will deploy as a draft, which will take a few minutes, and then you’ll have live reload on subsequent deployments. To experience the type of workflows this enables, check out our demo notebook on GitHub or Google Colab.
Testing in a production-like environment
Deploying a model as a draft doesn’t interfere with existing versions, and only creates a new version when you are satisfied with your model. Until then, your primary version stays the same, allowing you to test your draft model without changing production systems.
Draft models operate much like fully deployed models. You can invoke them by version ID, call them in worklets, and test them with draft applications. While they don’t have the same uptime and scaling features of a full deployment, a draft model is a nearly identical substitute to use for integration testing and application development. And what’s more, draft models don’t count against your workspace billing limits; experiment freely and only deploy what works!
Accelerate your dev loop
And if you’re not yet using Baseten to deploy your models, sign up today to accelerate your dev loop with draft models and the entire Baseten platform.
Choosing the right horizontal scaling setup for high-traffic models
Horizontal scaling via replicas with load balancing is an important technique for handling high traffic to an ML model. Let’s examine three tips for understanding how to properly replicate your instances to save users time without wasting your money.