Feb 6, 2024

Double inference speed and throughput with NVIDIA H100 GPUs

Baseten is now offering model inference on H100 GPUs starting at $9.984/hour. Switching to H100s offers a 18 to 45 percent improvement in price to performance vs equivalent A100 workloads using TensorRT and TensorRT-LLM.

H100 stats

We’re using SXM H100s, which feature:

989.5 teraFLOPs of fp16 tensor compute (vs 312 for 80GB SXM A100)
80 GB of VRAM (matching 80GB SXM A100)
3.35 TB/s memory bandwidth (vs 2.039 for 80GB SXM A100)

Most critically for LLM inference, the H100 offers 64% higher memory bandwidth, though the speedup in compute also helps for compute-bound tasks like prefill (which means much faster time to first token).

Replacing A100 workloads with H100

An instance with a single H100 costs 62% more ($9.984/hr) than a single A100 instance ($6.15). Just by looking at the stat sheet, with a 64% increase in memory bandwidth, you wouldn’t expect any improvement in performance per dollar.

However, thanks to TensorRT, you can save 18 to 45 percent on inference costs for workloads that use two or more A100-based instances by switching to H100s.

The H100 offers more than just increased memory bandwidth and higher core counts. TensorRT optimizes models to run on the H100’s new Hopper architecture, which unlocks additional performance:

Running Mistral 7B, we observed approximately 2x higher tokens per second and 2-3x lower prefill time across all batch sizes
Running Stable Diffusion XL, we observed approximately 2x lower total generation time across all step counts

With twice the performance at only 62% higher price, switching to H100 offers 18% savings vs A100, with better latency. But if increase concurrency on the H100 until latency reaches A100 benchmarks, you can get as high as three times the throughput — a 45% savings on high-volume workloads.

H100 pricing and instance types

An instance with a single H100 GPU costs $9.984/hour. Instances are available with 2, 4, and 8 H100 GPUs for running larger models that require extra VRAM; pricing scales linearly with GPU count.

Run your model on H100 GPUs

We’ve opened up access to our first batch of H100 GPUs, and plan to aggressively scale our capacity. To enable H100 access in your Baseten account, get in touch and tell us about your use case and we’ll help you achieve substantial cost savings and performance improvements by switching to H100 GPUs for model inference.