FP8: Efficient model inference with 8-bit floating point numbers

The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.

The benefits of globally distributed infrastructure for model serving

Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.

Why GPU utilization matters for model inference

Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.

Introduction to quantizing ML models

Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.

How to benchmark image generation models like Stable Diffusion XL

Benchmarking Stable Diffusion XL performance across latency, throughput, and cost depends on factors from hardware to model variant to inference config.

Understanding performance benchmarks for LLM inference

This guide helps you interpret LLM performance metrics to make direct comparisons on latency, throughput, and cost.

AI infrastructure: build vs. buy

AI infrastructure, ML infrastructure, build vs. buy, model deployment