Whether you’re looking for an optimized Mixtral 8x7B implementation or the exact right variant of SDXL with ControlNet, our new model library will help you compare models across version, variant, size, and implementation to find exactly what you need. And for that SDXL model, the newly available NVIDIA L4 GPU could be a great way to save money on inference — a rigorous performance benchmark will give definitive proof. All of this and more in this month’s newsletter!
In January, we published a new model library to better present the best open source machine learning models.
One focus in redesigning the model library was making it easier to navigate the taxonomy of open source models. A foundation model like Stable Diffusion comes in different versions, variants, and sizes, with different implementations such as model server implementation and additional ControlNet image pipelines (which themselves come in different variants like Depth and Canny).
The new model library categorizes models by task (text to image, text generation, text to audio, etc), by family (Stable Diffusion, Mistral, Audiogen, etc), and by publisher (Stability AI, Mistral AI, Meta, etc), then provides detailed information about each model’s version, variant, size, optimizations, license, and other essential properties.
Try the model library for yourself today, or learn more about what’s new in the changelog announcement.
The NVIDIA L4 GPU, based on NVIDIA’s latest Ada Lovelace architecture, is now available for model inference on Baseten starting at $0.8484/hour — about 70% of the base cost of an A10G-based instance.
The stat breakdown for L4 vs A10G shows some interesting tradeoffs:
The L4 has 121 TFLOPS of FP16 Tensor Compute (vs 70 TFLOPS on A10G)
The L4 has 24 GB of VRAM (matching 24GB on A10G)
The L4 has 300 GB/s of memory bandwidth (vs 600 GB/s on A10G)
In summary, the L4 has almost twice the compute power, but only half the memory bandwidth of the other 24-gigabyte GPU, the A10G. Most model inference for LLMs and other autoregressive transformers models tends to be memory-bound, meaning that the A10G is still a better pick for tasks like LLM chat. But the L4 is useful for cheaper inference on compute-bound workloads.
Check out our announcement of L4 availability for more details on these new GPUs!
We’re always working to make model inference as fast and cheap as possible. But that requires a nuanced understanding of what “fast” means, along with a clear picture of the different factors that affect inference speed and cost. These details come from running performance benchmarks for model inference.
Benchmarking considerations vary somewhat from model to model. We wrote guides for setting up and interpreting benchmarks for two of the most widely used types of models:
One major lever for improving model performance is quantization. Quantization lets you run large models on smaller or fewer GPUs by reducing the size of model weights (e.g. from 16-bit floating point numbers to 8-bit integers). But quantization can be a tricky process and has the potential to severely worsen model output quality.
Our new introduction to quantizing ML models breaks down the advantages and risks of quantizing models for inference, based in part on our work quantizing Mixtral 8x7B for more efficient inference.
We’ll be back next month with more from the world of open source ML!
Thanks for reading,
— The team at Baseten