# Introduction to quantizing ML models

### TL;DR

Quantization is the process of taking an ML modelâ€™s weights and mapping them to a different number format that uses fewer bytes per parameter. This makes model inference substantially faster by making memory access more efficient and unlocking greater Tensor Core compute. However, quantization can meaningfully reduce model output quality for LLMs unless done carefully to preserve the modelâ€™s perplexity.

ML models â€” from the simplest classification model to the most advanced LLM â€” are made up of parameters: numerical values assigned during training that determine how a model operates. Models can have billions of parameters; the â€ś70Bâ€ť in â€śLlama 2 70Bâ€ť reveals that the model is made up of seventy billion parameters. These parameters are stored along with other configuration information in a file, called the model weights file, and are loaded onto a GPU to run inference.

At a super basic level, model weights are big matrices of numbers. These numbers all have a certain format, called a â€śprecision.â€ť ML model weights generally use one of the following four precisions:

`float32`

(fp32): a 32-bit floating point number called â€śfull precisionâ€ť`float16`

(fp16): a 16-bit floating point number called â€śhalf precision.â€ť`int8`

: an 8-bit integer format.`int4`

: a 4-bit integer format.

Quantizing is lowering the precision of your model weights by one or more steps along this chain, such as going from `float16`

to `int8`

. This in turn reduces the demand that running the model places on GPU compute and memory, improving inference performance.

## Why float16 is the default precision for LLM inference

Using a given precision for ML inference is a tradeoff between speed and accuracy. Using a higher precision means more accurate calculations, which manifests in higher output quality. But those more precise number formats are larger, which slows down inference considerably.

For inference on most LLMs and many other ML models, `float16`

is the go-to precision because it presents the right tradeoff between expressiveness and speed:

`float16`

is a sufficiently expressive format for most model weights but a substantial savings versus`float32`

.`float16`

computation is well-supported on GPU Tensor Cores.`float16`

is a good starting point and best practice, but other precisions have a lot to offer as well.

Quantizing a model is a kind of compression, just like creating a JPEG image. When working with images, you need to find a balance between image size and quality. `float16`

is like a high-quality JPEG: a larger but reasonable size with almost no loss from compression.

`float16`

is a well-supported format on GPU Tensor Cores, offering strong compute performance. When looking at a GPUâ€™s `float16`

capabilities, donâ€™t be distracted by a format called `bfloat16`

, as thatâ€™s meant for model training, not inference. For inference, a GPUâ€™s TFLOPS on FP16 Tensor Cores is your metric. While most LLM inference is memory-bound, it doesnâ€™t hurt that `float16`

works well for compute, too.

That said, both `int8`

and `int4`

are very capable formats that, when used carefully, can create the same or nearly the same output quality during inference while saving substantially on memory bandwidth and increasing available compute resources. Thatâ€™s where quantization comes in to take advantage of these small-footprint number formats.

## How to quantize an ML model

Quantization is not a uniform process. There are different algorithms, approaches, and implementations across various libraries and frameworks. We heavily use TensorRT-LLM, a model serving framework by NVIDIA, which also has built-in quantization algorithms.Â

A quantization algorithm finds a scaling factor to apply to model weights.

One simple quantization algorithm is weights-only quantization, which weâ€™ve had success with.

Quantization requires similar GPU resources to inference; nothing close to whatâ€™s required for training or fine tuning.

Quantization algorithms come in different forms, but they generally work by finding a scaling factor and using it to map a range of one type of values (say 16-bit floating point numbers) to a range of another type of values (say 8-bit integers). This scaling factor needs to consistently map these values without distortion to be effective.

One quantization algorithm that weâ€™ve had success using is weights-only quantization. This algorithm only quantizes the model weights, leaving activation functions alone. This preserves some precision in key areas while quantizing most values. Weights-only quantization is implemented in TensorRT-LLM as well as many other frameworks and can be a great first approach when working with a new model.

Like with training or fine tuning a model, quantizing a model is a one-time process. Once you create quantized model weights, you can use them over and over again for inference. Fortunately, compared to training or fine tuning, quantizing only requires minimal GPU resources, similar to inference. When we quantized Mixtral 8x7B with TensorRT-LLM, running the weights-only algorithm only took a few minutes on an A100.

## How quantization improves inference performance

Quantizing a model makes running inference faster and cheaper in three ways:

Reducing the file size of model weights files so you can use fewer or smaller GPUs

Reducing the amount of memory access required by up to 4x per level of precision

Increasing the amount of compute available by up to 2x per level of precision

In `float16`

, every billion parameters of an LLM results in about two gigabytes of model weights. So the weights files for, say, Mistral 7B are about 14 gigabytes, while Mixtral 8x7Bâ€™s 46.7 billion parameters come in at just under 94 gigabytes.

Quantizing reduces these file sizes linearly. Going from `float16`

to `int8`

for Mixtral cuts the file size to about 47 gigabytes, which fits on a single 80-gig A100 with plenty of headroom for inference. This means you can run the same model with half the hardware.

But quantizing offers more than just smaller file sizes. Most inference for LLMs is memory-bound, meaning that the GPU memoryâ€™s bandwidth is just as important as its size. As quantized models use fewer bits per parameter, reads from memory are faster, speeding up inference. And it doesnâ€™t just double â€” because model weights are stored in matrices, there is four times less memory access required when working with `int8`

vs `float16`

, and another 4x for going down to `int4`

.

Where inference is instead compute-bound, such as the prefill phase for language models, quantization is still helpful on most GPUs. GPUs have cores dedicated to computations at different precisions. For example, the A100 is capable of 312 TFLOPS of compute at `float16`

, but it doubles to 624 trillion operations at `int8`

. While thereâ€™s always some overhead, quantization offers nearly linear speedups for compute-bound operations.

Even though memory access is up to four times faster and compute only gets up to twice as fast, most LLM inference is still going to be memory bound after quantizing, even down to `int4`

.

## Why quantization can make model output worse

There is a catch to all of these performance gains. Quantizing can sometimes make model output substantially worse. This depends on three factors:

The modelâ€™s perplexity, a score used to approximate a modelâ€™s output quality.

The size of the model, as models with more parameters generally quantize better.

The dynamic range and distribution of model weights in key sections of the model.

Perplexity measures model output quality by giving it a fixed set of inputs and outputs and calculating how â€śsurprisedâ€ť the model is by the output given the input â€” in other words, the modelâ€™s likelihood of generating a known good output in a fixed situation. For perplexity, lower scores are better.

When you quantize a model, one thing to look at is the modelâ€™s perplexity before and after quantizing. If thereâ€™s a noticeable increase in perplexity, that can signal that the quantization process degraded the modelâ€™s reasoning ability.

This is more likely to happen to smaller models than larger ones. Intuitively, as smaller models have fewer parameters, each parameter is more important. Thereâ€™s less room for error. Fortunately, quantization is generally most useful on larger models with many billions of parameters, as smaller 7B models already fit on less expensive GPUs.

However, more than model size, the dynamic range of model weights within a given area of the model is the clearest indicator of how successful quantizing will be. Remember, quantizing a model is effectively mapping a set of larger, more expressive numbers to a set of smaller, less expressive ones. A large range and uniform distribution within model weights means that the mapping will be clear. As with quantizing in general, there are a range of algorithms and techniques to calculate this, but the distribution of model weights is a good predictor of how well it will survive quantization.

## When to quantize models for production inference

Quantizing models can be a great way to save money on inference by using fewer, smaller, less expensive GPUs to run the same model. If you can preserve an LLMâ€™s perplexity during the quantization process from `float16`

to `int8`

or even `int4`

, you can get up to four times better memory usage and two times better compute performance without drastically affecting the model outputs.

Just as there exists a wider variety of quantization algorithms and frameworks than we covered here, so too exist other number formats. Recent GPUs like the L4 and H100 support `float8`

as an alternative to `int8`

. Depending on the algorithm used, quantizing to `float8`

can be an improvement upon using `int8`

for some models. Many cards also support `float64`

, known as â€śdouble precision,â€ť but itâ€™s almost never relevant for model inference. By default, most LLM inference is done in `float16`

, with quantization taking inference to one of the two lower precisions.

Quantization is a complex balance to get right â€” you want to target uniform values within the model weights without affecting the activation functions, KV caches, and other essential values for inference. And models like Mixtral 8x7B, with its mixture of experts architecture, introduce new complications like routers. But the payoff is that a successful quantization only takes similar GPU resources to a few minutes of inference then permanently reduces the cost to operate your model.