How to choose the right instance size for your ML models

When you deploy a packaged machine learning model to a cloud service like AWS, you have to choose an instance type for the server that will run your model. Larger instances are more powerful, but also more expensive. And the naming and sizing of instances gets complicated quickly. Fortunately, you can follow a few simple heuristics to select an appropriate instance size that can handle your model while minimizing compute cost.

Instance sizing includes a few different factors:

vCPU cores: The number of CPU cores for running your code
Memory: The amount of RAM for storing data that operations running on the CPU needs to access quickly
GPU: The number and type of GPUs attached to the instance
Video memory: The amount of VRAM for storing data that operations running on the GPU needs to access quickly

Instance sizing can include other factors like core type and persistent storage, but you usually only need to think about the four factors listed above. With these factors in mind, you only need to make two key decisions to select the right instance type for your model.

First key decision: CPU or GPU

Some models can be served on a GPU. Large foundational models like Stable Diffusion and Whisper run on GPUs. When such a model is invoked, the GPU is used to process the invocation with hardware that far outshines CPUs for concurrent task processing. Other models, like most regression models, only need a standard CPU.

If your model can run on a GPU and invocation speed matters for your use case, you’ll want to select an instance with an attached GPU. Otherwise, stick with the less expensive CPU instances.

Unless your model or model serving code is specifically optimized to take advantage of parallelization from multiple CPU cores, vCPU count doesn’t matter. Instead, you’ll determine instance size based on your memory requirements.

Second key decision: Memory size

Once you’ve determined the type of compute unit your model will run on, you need to select an amount of memory associated with that compute unit. This will effectively determine your instance size. So if your model uses a GPU, base your instance size on GPU memory, and if it doesn’t, base it on standard/CPU memory.

To select a memory size, check the size of your model weights files and all other files that your model needs to have loaded into memory, then select the next size up in memory.

For example, if your model is 6GB and your options for memory size include 4GB, 8GB, and 12GB:

4GB is not enough, and will lead to out-of-memory errors.
12GB is just unnecessary headroom; leaving you paying more money for the same results.
8GB is just right. You’d select the instance that has 8GB of memory.

That said, if your model is just barely smaller than the closest instance size, like a 7.75GB, it might be worth springing for the larger instance for enough headroom to avoid any issues.

You just need an instance that’s a bit larger than your model. If the instance is too small, you’ll run into out-of-memory errors, but if your instance is oversized, you’re paying more for no benefit.

These two key decisions will help you select the appropriate instance types for most model deployments. Instance types can be inflexible. For example, you might end up selecting an instance size with more vCPUs than you need because it is the only one that has enough memory to load your model weights. But that’s okay, what matters is selecting the cheapest available instance that can satisfactorily run your model.

If you’re looking for a platform that offers model deployment with simple instance sizing (and a whole lot more), try out Baseten today.

Subscribe to our newsletter

Stay up to date on model performance, GPUs, and more.

‌

How to choose the right instance size for your ML models

First key decision: CPU or GPU

Second key decision: Memory size

Subscribe to our newsletter

Related Infrastructure posts

Accelerating inference with NVIDIA B200 GPUs

Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud

Evaluating NVIDIA H200 Tensor Core GPUs for LLM inference