New in December 2023

Prompt: A forest green airplane on the runway at dawn. Model: Playground v2.

TL;DR

In 2023, open-source ML models went from good to great. Models like Llama 2, Mistral, and Stable Diffusion XL set the pace for open source, while tools and techniques like LangChain and retrieval augmented generation went mainstream. There is so much to look forward to in 2024, but before we get there, let’s round off the year with a December recap featuring two powerful new models and a tutorial for deploying ComfyUI projects.

Faster Mixtral inference with TensorRT-LLM and int8 quantization

Mixtral 8x7B is a new open-source LLM that meets or exceeds Llama 2 70B in output quality — and Mixtral also wins on inference speed. Mixtral has two structural advantages from its mixture of experts architecture:

Mixtral is only 46.7B parameters (not 56B because attention layers are shared)
During inference only 12.9B parameters are used (Llama 2 has to use all 70B)

We took Mixtral and made it even faster using TensorRT-LLM and int8 quantization.

Each layer of inference only uses two of eight experts.

Even with the mixture of experts architecture, all 46.7B parameters need to be loaded into VRAM during inference. So in float16, the model requires two A100 GPUs to run.

But by quantizing to int8, you can fit the same model in just one A100 with almost no loss in quality. We use compare perplexity — a measure of how likely it is that the model would generate a given sentence — between the quantized and non-quantized to validate that there is no perceptible drop in model quality.

Mixtral only gains 0.08% in perplexity when quantized to int8.

For the full results of our experimentation, read our Mixtral optimization report. If you’re looking for fast, reliable Mixtral inference, let us know and we can get you set up with the model and the required A100 GPUs.

Playground v2: a new image generation model for striking visuals

This month, Playground released Playground v2, a text-to-image model that shares an architecture with SDXL but is trained from scratch to create consistent, stylized images. Playground v2 could become the go-to image generation model for a number of use cases, from blog post header images to game asset generation.

Prompt: A scenic mountain landscape.

In our breakdown of Playground v2 versus SDXL, we compared the model outputs for a half-dozen prompts designed to demonstrate the capabilities and limitations of each model. Playground v2 demonstrated consistently high quality but lower stylistic range, while SDXL created a wide variety of images — some excellent, others less so. Give the comparison a read to determine which model you prefer for your application.

Deploy ComfyUI inference pipelines as API endpoints

ComfyUI is a modular GUI and backend for creating Stable Diffusion pipelines. These pipelines combine multiple models to create advanced images.

Three Baseten logos generated with ComfyUI.

ComfyUI runs models on your local machine. While you can export and share your projects with other ComfyUI users, you may instead want to use your image generation pipeline behind an API endpoint to share it more widely or integrate it into an application.

In our how-to guide for deploying ComfyUI models, we go step-by-step through packaging and deploying an image generation pipeline with Truss. As an example, the guide uses a ControlNet model to create images from a logo, but you can adapt the provided Truss to deploy any ComfyUI project.

We’ll be back next year with more models, guides, and open source projects than ever!

Thanks for reading!

— The team at Baseten

New in December 2023

TL;DR

Faster Mixtral inference with TensorRT-LLM and int8 quantization

Playground v2: a new image generation model for striking visuals

Deploy ComfyUI inference pipelines as API endpoints

Related Product posts

Using asynchronous inference in production

Baseten Chains explained: building multi-component AI workflows at scale

New in May 2024