May 1, 2024

New in April 2024

Prompt: the steps and entrance to a solarpunk museum

TL;DR

From 3.8 to 141 billion parameters, April saw the release of best in class LLMs of all sizes. Phi 3, Llama 3, and Mixtral 8x22B soared to the top of eval rankings and best model lists industry-wide. Combined with other modalities, like streaming real-time speech synthesis, these models offer new possibilities for building with AI. And as deployments get more sophisticated, introducing best practices like CI/CD pipelines helps keep production stable and development smooth. Welcome to our April newsletter!

Best in class LLMs at four different sizes

We overhauled our ranking of the best open source large language models this month after chart-topping releases from Microsoft, Meta, and Mistral. What’s special about these recent models is the range of sizes covered, from 3.8 billion parameters for Phi 3 Mini up to 141 billion parameters for Mixtral 8x22B. This gives you maximum flexibility to trade off between cost and output quality depending on use case.

You can deploy cutting-edge LLMs directly from Baseten’s model library in one click:

3.8B (1 T4 GPU): Phi 3 Mini 4K/128K Instruct
8B (1 A100/H100 GPU): Llama 3 8B Instruct
70B (2 A100/H100 GPUs): Llama 3 70B Instruct
8x22B (4 A100/H100 GPUs): Mixtral 8x22B Instruct

For maximum performance with these new LLMs, optimizations like TensorRT-LLM implementations, FP8 quantization, and continuous batching improve inference efficiency, reduce latency, and increase throughput.

Streaming text to speech with XTTS V2

Every LLM in Baseten’s model library has a streaming endpoint – you get model output as it is generated. Streaming is table stakes for LLMs, but isn’t as common for other modalities.

Like an LLM, text to speech models like XTTS V2 are autoregressive models that work in tokens. The input text is tokenized and passed to the speech synthesis model, which iterates over the input to produce audio chunks.

Any autoregressive model can stream output. Speech synthesis models like XTTS V2 work much like LLMs: they tokenize input text and iterate over these inputs to produce audio in small chunks, which can be streamed back from the API endpoint.

Real-time streaming text to speech unlocks categories like AI phone calling and audio chatbots. With our new tutorial on streaming text to speech, you can build a streaming endpoint with fast time to first chunk and high quality vocals.

CI/CD for AI models

Every web developer has access to continuous integration and continuous deployment tools (CI/CD) to make deployment to production a stable, repeatable process. We think every AI engineer should have the same.

Building a CI/CD pipeline for AI models has unique challenges: matching environments closely between development and production, ensuring seamless rollout and rollback, and of course figuring out how to validate model output for correctness.

Every model will have its own requirements for useful CI/CD. Using Baseten’s model management API, you can build customized CI/CD tooling for model deployment.

One new feature that helps: the --wait flag in the truss push command. Ordinarily, truss push returns as soon as the packaged model has been sent to Baseten for deployment. But with the --wait flag, the command doesn’t return a status code until the deployment process either succeeds or fails.

truss push --wait

We’ll be back next month with more from the world of open source AI!

Thanks for reading,

— The team at Baseten

New in April 2024

TL;DR

Best in class LLMs at four different sizes

Streaming text to speech with XTTS V2

CI/CD for AI models

Related Product posts

Using Asynchronous Inference in Production

Baseten Chains Explained: Building Multi-Component AI Workflows at Scale

New in May 2024