The best open source large language model

Large language models (LLMs) are the definitive category in generative AI. But with tens of thousands of options, it can be hard to feel confident about making the right tradeoffs between output quality, speed, and cost — especially when models specialize in different tasks.

Taking a holistic view across technical specifications, customer conversations, and our own testing, we’ve put together this list of models to guide you in finding the right starting place for building on top of open source text generation models for chat, code completion, retrieval-augmented generation, and more LLM use cases.

Best overall open source LLM: Mixtral 8x7B Instruct

Released by Mistral AI in December 2023, Mixtral 8x7B is a midsize LLM that uses a mixture-of-experts architecture to deliver incredible output quality with just 46.7 billion parameters. The model is licensed for commercial use and performs well across a wide range of tasks, including code generation. The instruct variant is fine tuned for chat usage and is aligned without overshooting the mark (meaning it’ll tell you how to kill a Docker container).

What we love about Mixtral 8x7B:

  • High output quality in all of our testing (plus with state-of-the-art performance on evaluation benchmarks).

  • 32k-token context window supports almost any use case plus retrieval-augmented generation.

  • Efficient inference on just one A100 for reasonable operating costs.

  • Permissive Apache 2.0 license for unrestricted commercial use.

What to watch out for with Mixtral 8x7B:

  • Batching model requests reduces efficiency gains from Mixture of Experts architecture.

  • Code generation capabilities, while decent overall, fall short of specialized models.

  • Light-touch alignment may not be suitable for all use cases.

Run Mixtral 8x7B optimized with TensorRT-LLM.

Another great LLM: Mistral 7B Instruct

The seven billion parameter “weight class” of large language models has some of the best performance per dollar on the market as these highly capable models run well on just a single A10G GPU. We recommend 7B models for experimentation and testing … but don’t be surprised if the results quality ends up being good enough for many production use cases.

Despite a smaller context window and worse reasoning ability than larger models, Mistral 7B is a capable do-it-all model with strong benchmark performance and efficient inference on A10G GPUs.

What we love about Mistral 7B:

  • Excellent output quality for a 7B parameter LLM.

  • 8k-token context window beats most 7B models for long conversations and retrieval-augmented generation (RAG).

  • High performance per dollar on just one A10G for inference.

  • Permissive Apache 2.0 license for unrestricted commercial use.

What to watch out for with Mistral 7B:

  • Not as powerful as larger LLM models like Mixtral 8x7B.

  • Can start to lose context with longer conversations — for example it starts giving wrong answers after 5-6 commands when emulating a terminal in this Langchain example.

  • Light-touch alignment may not be suitable for all use cases.

Deploy Mistral 7B on an A10G.

Best aligned chat LLM: Zephyr 7B

Alignment is a tricky balance. On the one hand, you want a model that isn’t going to generate hurtful or incorrect content. On the other hand, you want the model to stay useful and not refuse to answer genuine queries. Hugging Face’s H4 research team is working on this problem with their Zephyr LLMs.

What we love about Zephyr 7B:

  • Helpful assistant behavior boosts output quality both on evaluation benchmarks and ordinary use.

  • Supports ChatCompletions-style roles out of the box.

  • Based on Mistral, Zephyr 7B inherits its permissive commercial licensing.

What to watch out for with Zephyr 7B:

  • Zephyr 7B has not been through more advanced in-the-loop alignment for safety and can generate problematic output when directly prompted.

  • Zephyr 7B doesn’t do well for math, code generation, and similar topics.

  • Still under active development, so you may need to upgrade to a new version soon for best performance.

Use Zephyr 7B with ChatCompletions-compatible API endpoints.

Best ML model for code generation: Code Llama

Code Llama is a project by Meta to fine tune their Llama 2 family of models to specialize in code generation tasks. The Code Llama family has nine models, as the model is available in four sizes (7B, 13B, 34B, and the new 70B) across three variants (Base, Instruct, and Python).

The models were trained on over 500 billion tokens of code, with additional specialized training variant-by-variant (e.g. the Python variant was trained on another 100 billion tokens of Python code). The largest models (34B and 70B parameters) outperform GPT on evaluation benchmarks targeted at code generation.

What we love about Code Llama:

  • 70B Instruct variant scores 67.8 on HumanEval (pass@1) vs 67.0 for GPT-4.

  • Four sizes (7B, 13B, 34B, and 70B) and three variants (Base, Instruct, and Python) for maximum flexibility.

  • 7B and 13B sizes are lower latency than 34B and have built-in code completion capabilities.

  • Large context window (up to 100K tokens) is essential for working with code as context (code is much more token-dense than natural language).

What to watch out for:

  • The most powerful 34B and 70B models — the only ones to surpass GPT on benchmarks — do not have code completion capabilities out of the box.

  • Only the Instruct variant is capable of responding to natural language prompts, the other two variants are code completion models.

  • Llama 2 models have a special license that also applies to Code Llama models.

Try Code Llama 7B Instruct for chat-based coding.

Best model for fine tuning: Llama 2

The Llama 2 family of LLMs offers the most flexibility for fine tuning projects across size (7B, 13B, and 70B) and focus (base and code variants). Given Llama 2 models’ strong base performance, any model from the family is a powerful foundation to build on. That’s why Llama 2 is the foundation model of choice for projects like WizardLM.

What we love about Llama 2 for fine tuning:

  • Base models in 3 different sizes (7B, 13B, 70B) lets you make tradeoffs between cost and performance.

  • Code Llama family provides a starting point for custom code generation models.

  • Llama models have a history as popular choices for fine tuning work, so there’s plenty of research and tooling to build on.

What to watch out for:

  • Heavy-handed alignment in the instruct/chat variants of the models means you may need to start from scratch from the base variant.

  • Llama 2 models have a special license that also applies to fine tuned variants.

Experiment with Llama 2 7B Chat on autoscaling infrastructure.

Can open source models replace OpenAI and ChatGPT?

Several LLMs recommended in this guide, including Mixtral 8x7B, Mistral 7B, and Zephyr 7B are available on Baseten with ChatCompletions-compatible endpoints, meaning you can test open-source LLMs in your existing code base just by changing a couple lines of configuration. If there’s another model you’d like to use with this endpoint, just let us know.

How much should I trust model evaluation benchmarks?

Model evaluation benchmarks measure an LLM’s performance on a fixed set of tasks. These benchmarks are designed to assess the accuracy and quality of the model’s output. While there is no universal standard benchmark, there are a number of popular options including ARC, HellaSwag, and MMLU.

There are some worries about the usefulness of evaluation benchmarks. Generally, evaluation benchmarks could be too narrow to fully capture a model’s performance, and more recently there have been concerns about evaluation sets leaking into models’ training data. These problems have solutions. It’s standard practice to look at a model’s average performance across several benchmarks to account for the limitations of any one benchmark, and researchers check for contamination of their training data before releasing models.

Benchmarks performance is a solid signal when picking an LLM, but isn’t the whole story. There’s no need to switch models every time a new variant comes out with a slight uptick in benchmark score, and the most important thing to do is evaluate model output for your exact use case.

What about ML models larger than 70 billion parameters?

Right now, the state of the art in open source ML models caps out at about 70 billion parameters. There are great open-source projects working on larger models — BLOOM and Falcon 180B are both powerful LLMs — but smaller models like our overall pick, Mixtral 8x7B, match or exceed their evaluation benchmark performance.

While we run Mixtral 8x7B one 80-gig A100 GPU, the team behind Falcon 180B recommends a minimum of 400 gigabytes of VRAM for inference. That’s 5 A100 GPUs, meaning that Falcon 180B costs at least five times as much to run while not consistently delivering better results.

There’s nothing magic about 70 billion parameters. As larger open source models are trained and released, we may adjust our recommendations. For now, the power of moderately-sized models means that we don’t recommend open source models larger than 70 billion parameters for almost any use case due to high operational cost increases for small to no gains in output quality.

What about models smaller than 7 billion parameters?

Small models can run fast on less expensive, more available GPUs, making them a cost-effective choice for simple use cases.

Until recently, LLMs smaller than 7 billion parameters haven’t been ready for production use, but new models are changing this. Phi-2 by Microsoft is now MIT licensed and offers strong performance with only 2.7 billion parameters, while StableLM by Stability AI comes in at 2.8 billion parameters. FLAN-T5 has sizes as small as 80M parameters, and the new Qwen family of LLMs also has a 1.8B size.

Small models are also great for highly specific use cases, like generating SQL code. NSQL 350M has one-twentieth as many parameters as Mistral 7B but does a good job at its one task: generating SQL code.

The best open source LLM

There’s no one best open source LLM, only the LLM that’s best for you. This selection depends on capabilities, features, price point, and license. New models are released every day, and it can feel overwhelming to keep up. But finding the right model

Deploy the best open source LLM for your use case in just a couple of clicks: