Comparing GPUs across architectures and tiers

Datacenter GPUs remain relevant for years — there are still K80s running production workloads today — while the terms used to market the cards change over time as NVIDIA highlights the advantages of each new series of GPUs.

So what are reliable metrics for comparing GPUs across architectures and tiers to decide which one is the most cost-effective way to run your workload? We’ll consider core count, FLOPS, VRAM, and TDP.

Core count

The graphics cards you analyze might have several different types of cores:

CUDA cores: the most general-purpose cores for a wide variety of computing tasks.
Tensor cores: optimized for certain machine learning calculations.
Ray-tracing (RT) cores: more important for gaming than most ML, these cores specialize in simulating the behavior of light.

Raw core count is a good signal, but it isn’t the whole story. Different cards have different types of cores — some have more tensor cores, others have more CUDA cores — and cards on newer architectures may also have new generations of some types of core. A proper comparison requires a more standardized metric: FLOPS.

FLOPS

FLOPS stands for Floating Point Operations Per Second and is the critical measure of GPU performance.

There’s a complicating factor, though. GPU performance is measured at various precisions. Precision is the size of each number in a calculation, from 8-bit integers to 64-bit double precision floating point values.

Number formats and corresponding use of bits

Computations on higher-precision number formats take more processing power. But that’s where Tensor cores come into play. Tensor cores can do mixed-precision computation, where they use a lower precision for most calculations then validate the results at a higher precision. Compare FLOPS at the same precision on the same core type for a proper apples-to-apples comparison between GPUs.

For example, at the highest precision (FP64), NVIDIA’s top-shelf A100 GPU reaches 9.7 teraFLOPS on standard CUDA cores, but its Tensor cores double that performance at the same precision with 19.7 teraFLOPS.

Lower precisions result in higher FLOPS counts. For example, here’s a comparison of compute power for the A10 and A100 GPUs at different precisions.

Comparison of operations per second on A10 and A100 at different precisions

VRAM

VRAM (Video Random Access Memory) is a graphics card’s onboard memory. VRAM is to GPUs as RAM is to CPUs. It stores data like model weights for rapid access during computations like model inference.

The most important factor for model serving is the amount of VRAM that a GPU has. For fast invocation, model weights must be stored in VRAM, so VRAM capacity limits model size.

Not all VRAM is equivalent. There are three other factors to consider:

Bus size measures the amount of data that can be transferred to and from VRAM at once. A larger bus is helpful for loading model weights faster.
Clock speed measures how fast the VRAM can process data, with higher clock speeds resulting in faster memory reads and writes.
GDDR and HBM are two different types of VRAM. HBM (High Bandwidth Memory) generally provides higher bandwidth with less power but costs more to manufacture than GDDR (Graphic Double Data Rate) memory. Recent 100-tier cards like the A100 and H100 use HBM.

To add a wrinkle, not all GPUs of the same tier have the same amount of VRAM. For example, the A100 comes in 40GB and 80GB versions. So before provisioning a GPU, make sure it has the right amount of VRAM to run your model.

TDP

TDP stands for Thermal Design Power, and it refers to the maximum number of Watts of electricity that a GPU is designed to draw while running. Higher-tier cards generally have a larger TDP than lower-tier cards, but it’s not a perfect correlation.

Datacenters price GPU compute time based on a variety of factors, but a card’s TDP is one of them. Electricity costs money, and it also generates heat, which costs even more money to get rid of. So cards with a higher TDP have a higher operating cost, which will affect the price you pay as an end user for compute time.