When you deploy a packaged machine learning model to a cloud service like AWS, you have to choose an instance type for the server that will run your model. Larger instances are more powerful, but also more expensive. And the naming and sizing of instances gets complicated quickly. Fortunately, you can follow a few simple heuristics to select an appropriate instance size that can handle your model while minimizing compute cost.
Instance sizing includes a few different factors:
- vCPU cores: The number of CPU cores for running your code
- Memory: The amount of RAM for storing data that operations running on the CPU needs to access quickly
- GPU: The number and type of GPUs attached to the instance
- Video memory: The amount of VRAM for storing data that operations running on the GPU needs to access quickly
Instance sizing can include other factors like core type and persistent storage, but you usually only need to think about the four factors listed above. With these factors in mind, you only need to make two key decisions to select the right instance type for your model.
First key decision: CPU or GPU
Some models can be served on a GPU. Large foundational models like Stable Diffusion and Whisper run on GPUs. When such a model is invoked, the GPU is used to process the invocation with hardware that far outshines CPUs for concurrent task processing. Other models, like most regression models, only need a standard CPU.
If your model can run on a GPU and invocation speed matters for your use case, you’ll want to select an instance with an attached GPU. Otherwise, stick with the less expensive CPU instances.
Unless your model or model serving code is specifically optimized to take advantage of parallelization from multiple CPU cores, vCPU count doesn’t matter. Instead, you’ll determine instance size based on your memory requirements.
Second key decision: Memory size
Once you’ve determined the type of compute unit your model will run on, you need to select an amount of memory associated with that compute unit. This will effectively determine your instance size. So if your model uses a GPU, base your instance size on GPU memory, and if it doesn’t, base it on standard/CPU memory.
To select a memory size, check the size of your model weights files and all other files that your model needs to have loaded into memory, then select the next size up in memory.
For example, if your model is 6GB and your options for memory size include 4GB, 8GB, and 12GB:
- 4GB is not enough, and will lead to out-of-memory errors.
- 12GB is just unnecessary headroom; leaving you paying more money for the same results.
- 8GB is just right. You’d select the instance that has 8GB of memory.
That said, if your model is just barely smaller than the closest instance size, like a 7.75GB, it might be worth springing for the larger instance for enough headroom to avoid any issues.
These two key decisions will help you select the appropriate instance types for most model deployments. Instance types can be inflexible. For example, you might end up selecting an instance size with more vCPUs than you need because it is the only one that has enough memory to load your model weights. But that’s okay, what matters is selecting the cheapest available instance that can satisfactorily run your model.
If you’re looking for a platform that offers model deployment with simple instance sizing (and a whole lot more), try out Baseten today.
Fine-tune FLAN-T5 on Blueprint today!
You can now fine-tune FLAN-T5, an instruction-tuned text-to-text transformer model developed by Google, on Blueprint!