Why GPU utilization matters for model inference

GPU utilization is the measure of how much of a GPU’s resources are in use at any given time during a workload. When running ML models, we want to maximize GPU utilization to decrease the cost of serving high-traffic model endpoints. If you get more performance from each GPU, you’re able to serve the same traffic with fewer GPUs, saving on model hosting costs.

Imagine you’re at the office with your entire team — let’s say 12 people. You all need to get to an event across town, so you book some Ubers. If you pack in 4 people per car, you only need to call 3 cars. But if only 2 or 3 people get in each car, you’ll need more — potentially spending twice as much.

Just like this rideshare metaphor only makes sense with a large group of people, GPU utilization becomes important with higher-traffic workloads. When you’re serving so many requests to your model that you have to spin up additional instances to handle the load, you want to make sure that each instance that you’re paying for is doing as much work as possible.

How to measure GPU utilization

There are three main stats to consider for GPU utilization:

Compute usage: what percentage of the time is a GPU running a kernel vs sitting idle?
Memory usage: what amount of the GPU’s VRAM is active during inference?
Memory bandwidth usage: how much of the available bandwidth is being used to send data to the compute cores?

When we talk about improving GPU utilization for LLMs, we almost always mean increasing compute usage. This is because memory bandwidth is generally the bottleneck on inference speed and compute capacity might be left on the table. While overall VRAM capacity caps the model size and number of concurrent prompts, it’s generally not the usage number we’re trying to increase.

Some parts of running a model are compute bound, meaning that the bottleneck for performance is how fast the GPU can calculate values. One compute-bound process is the prefill phase of an LLM, where the model is processing the full prompt to create the first token of its response.

But most parts of LLM inference are memory bound. After the first token, the bulk of the generation process for LLMs is memory bound, meaning that the bandwidth on the GPU’s VRAM is the limiting factor in how quickly tokens (or images, transcriptions, audio files, etc) can be generated.

Given that most LLM inference is memory transfer bound, we look for strategies to increase compute utilization so that we can run more calculations per byte of memory accessed.

How to increase GPU utilization

Generally, you increase GPU utilization by increasing batch sizes during inference. The batch size determines how many user inputs are processed concurrently in the LLM. A larger batch size lets a model use more compute resources even when memory bound. Every model weight read from VRAM is applied to more outputs at once, increasing the amount of compute you can use per byte of bandwidth.

Increasing batch size increases throughput, which is the measure of how many requests a GPU instance can handle per second. However, increasing throughput generally makes latency worse, meaning users have to wait longer to get model output. It’s important to manage this tradeoff when trying to maximize utilization.

Once you have high utilization across multiple instances, it’s worth considering a switch to a more powerful GPU type. For example, switching from A100 to H100 can save 20-45% on workloads with high utilization and enough traffic to require multiple A100 GPUs.

To extend our rideshare metaphor, switching to H100 is like getting Uber XL rides for your group of 12 — at 6 passengers to a car, you only need two cars, saving more money even if XL rides are slightly more expensive.

How to track GPU utilization

In your Baseten workspace, you can see the GPU utilization across compute and VRAM capacity (not bandwidth) for each deployed model. These charts align by timestamp with charts for traffic and autoscaling, so you can see exactly how real-world usage affects utilization.