The best open source large language model

Large language models (LLMs) are the definitive category in generative AI. But with tens of thousands of options, it can be hard to feel confident about making the right tradeoffs between output quality, speed, and cost — especially when models specialize in different tasks.

Taking a holistic view across technical specifications, customer conversations, and our own testing, we’ve put together this list of models to guide you in finding the right starting place for building on top of open source text generation models for chat, code completion, retrieval-augmented generation, and more LLM use cases.

Best overall open source LLM: Llama 3 70B Instruct

Meta's latest LLM family, Llama 3, offers 8B and 70B parameter instruct-tuned models with excellent benchmark performance. The larger model, Llama 3 70B Instruct, compares favorably to GPT-3.5, Gemini Pro 1.5, and Claude 3 Sonnet. The models were trained on over 15 trillion tokens, with an emphasis on code and a knowledge cutoff date of December 2023 for Llama 3 70B (March 2024 for Llama 3 8B).

What we love about Llama 3 70B Instruct:

  • Along with Meta's systematic investments in safety, Llama 3 models have been instruct-tuned to reduce false refusal rates.

  • Strong code generation and mathematical reasoning capabilities in a general model.

  • New, more efficient tokenizer yields up to 15% fewer tokens, meaning you generate fewer tokens per request.

What to watch out for with Llama 3 70B Instruct:

  • Relatively small 8k-token context window is 1/4 to 1/8 the size of similarly powerful models.

  • Llama 3 70B Instruct is an English-first model with nearly 95% of tokens in the dataset in English.

  • Llama 3 models have a custom commercial license that also applies to any fine-tuned derivatives.

Get started with Llama 3 70B Instruct or try the smaller but still excellent Llama 3 8B Instruct.

The best big LLM: Mixtral 8x22B Instruct

In April 2024, Mistral AI announced Mixtral 8x22B Instruct, a new open source LLM that, like December's Mixtral 8x7B, uses a mixture-of-experts architecture to enable a large model (141B parameters) to only use 39B active parameters during inference. This makes the model cheaper to run while still yielding high-quality output.

What we love about Mixtral 8x7B:

  • Natively multilingual model with fluency in English, French, Italian, German, and Spanish.

  • Massive 64k-token context window for retrieval-augmented generation and tool use.

  • Built-in function calling with specialized tool and function tokens.

  • Permissive Apache 2.0 license for unrestricted commercial use.

What to watch out for with Mixtral 8x7B:

  • Batching model requests reduces efficiency gains from Mixture of Experts architecture.

  • Inference requires multiple 80-gigabyte GPUs, for some use cases a less-powerful model will suffice at a lower cost.

  • Light-touch alignment may not be suitable for all use cases.

Deploy Mixtral 8x22B in one click.

Another great LLM: Mixtral 8x7B Instruct

Released by Mistral AI in December 2023, Mixtral 8x7B is a midsize LLM that uses a mixture-of-experts architecture to deliver incredible output quality with just 46.7 billion parameters. The model is licensed for commercial use and performs well across a wide range of tasks, including code generation. The instruct variant is fine tuned for chat usage and is aligned without overshooting the mark (meaning it’ll tell you how to kill a Docker container).

What we love about Mixtral 8x7B:

  • High output quality in all of our testing (plus with state-of-the-art performance on evaluation benchmarks).

  • 32k-token context window supports almost any use case plus retrieval-augmented generation.

  • Efficient inference on just one A100 for reasonable operating costs.

  • Permissive Apache 2.0 license for unrestricted commercial use.

What to watch out for with Mixtral 8x7B:

  • Batching model requests reduces efficiency gains from Mixture of Experts architecture.

  • Code generation capabilities, while decent overall, fall short of specialized models.

  • Light-touch alignment may not be suitable for all use cases.

Run Mixtral 8x7B optimized with TensorRT-LLM.

Best aligned chat LLM: Zephyr model family

Alignment is a tricky balance. On the one hand, you want a model that isn’t going to generate hurtful or incorrect content. On the other hand, you want the model to stay useful and not refuse to answer genuine queries. Hugging Face’s H4 research team is working on this problem with their Zephyr LLMs.

What we love about Zephyr models:

  • Helpful assistant behavior boosts output quality both on evaluation benchmarks and ordinary use.

  • Supports ChatCompletions-style roles out of the box.

  • Based on Mistral, Zephyr 7B and 8x22B inherit permissive commercial licensing.

What to watch out for with Zephyr models:

  • Zephyr 7B has not been through more advanced in-the-loop alignment for safety and can generate problematic output when directly prompted.

  • Zephyr 7B doesn’t do well for math, code generation, and similar topics.

  • Still under active development, so you may need to upgrade to a new version soon for best performance.

Use Zephyr 7B with ChatCompletions-compatible API endpoints.

Best ML model for code generation: Code Llama

Code Llama is a project by Meta to fine tune their Llama 2 family of models to specialize in code generation tasks. The Code Llama family has nine models, as the model is available in four sizes (7B, 13B, 34B, and the new 70B) across three variants (Base, Instruct, and Python).

The models were trained on over 500 billion tokens of code, with additional specialized training variant-by-variant (e.g. the Python variant was trained on another 100 billion tokens of Python code). The largest models (34B and 70B parameters) outperform GPT on evaluation benchmarks targeted at code generation.

What we love about Code Llama:

  • 70B Instruct variant scores 67.8 on HumanEval (pass@1) vs 67.0 for GPT-4.

  • Four sizes (7B, 13B, 34B, and 70B) and three variants (Base, Instruct, and Python) for maximum flexibility.

  • 7B and 13B sizes are lower latency than 34B and have built-in code completion capabilities.

  • Large context window (up to 100K tokens) is essential for working with code as context (code is much more token-dense than natural language).

What to watch out for:

  • The most powerful 34B and 70B models — the only ones to surpass GPT on benchmarks — do not have code completion capabilities out of the box.

  • Only the Instruct variant is capable of responding to natural language prompts, the other two variants are code completion models.

  • Llama models have a special license that also applies to Code Llama models.

Try Code Llama 7B Instruct for chat-based coding.

Best model for fine tuning: Llama 3

The Llama 3 family of LLMs offers the most flexibility for fine tuning projects across size (8B and 70B) and focus (base and code variants). Given Llama 3 models’ strong base performance, any model from the family is a powerful foundation to build on. That’s why Llama 3 is the foundation model of choice for projects like WizardLM.

What we love about Llama 3 for fine tuning:

  • Base models in 2 different sizes (8B, 70B) lets you make tradeoffs between cost and performance.

  • Code Llama family provides a starting point for custom code generation models.

  • Llama models have a history as popular choices for fine tuning work, so there’s plenty of research and tooling to build on.

What to watch out for:

  • Heavy-handed alignment in the instruct/chat variants of the models means you may need to start from scratch from the base variant.

  • Llama 3 models have a special license that also applies to fine tuned variants.

Experiment with Llama 3 8B Instruct on autoscaling infrastructure.

Can open source models replace OpenAI and ChatGPT?

Newer open source LLMs like Llama 3 70B Instruct compare favorably to closed-source options like GPT-3.5, Gemini Pro 1.5, and Claude 3 Sonnet. And with fine-tuning, open source models can match or beat the best closed-source models for specific tasks.

Several LLMs recommended in this guide, including Llama 3 70B, Mixtral 8x7B, and Zephyr 7B meaning you can test open-source LLMs in your existing code base just by changing a couple lines of configuration. If there’s another model you’d like to use with this endpoint, just let us know.

How much should I trust model evaluation benchmarks?

Model evaluation benchmarks measure an LLM’s performance on a fixed set of tasks. These benchmarks are designed to assess the accuracy and quality of the model’s output. While there is no universal standard benchmark, there are a number of popular options including ARC, HellaSwag, and MMLU.

There are some worries about the usefulness of evaluation benchmarks. Generally, evaluation benchmarks could be too narrow to fully capture a model’s performance, and more recently there have been concerns about evaluation sets leaking into models’ training data. These problems have solutions. It’s standard practice to look at a model’s average performance across several benchmarks to account for the limitations of any one benchmark, and researchers check for contamination of their training data before releasing models.

Benchmarks performance is a solid signal when picking an LLM, but isn’t the whole story. There’s no need to switch models every time a new variant comes out with a slight uptick in benchmark score, and the most important thing to do is evaluate model output for your exact use case.

What about ML models larger than 70 billion active parameters?

Right now, the state of the art in open source ML models caps out at about 70 billion active parameters (Mixtral 8x22B has only 39B active parameters). However, Meta is working on Llama 3 400B, which promises to be a massively powerful model.

Older large models, such as BLOOM and Falcon 180B, perform worse on evaluation benchmarks than newer midsize models like Mixtral 8x7B while requiring substantially more GPU resources for inference. Until a model like Llama 3 400B is released that justifies the inference cost, stick with midsize models that offer excellent performance at reasonable cost.

What about models smaller than 7 billion parameters?

Small models can run fast on less expensive, more available GPUs, making them a cost-effective choice for simple use cases.

Until recently, LLMs smaller than 7 billion parameters haven’t been ready for production use, but new models are changing this. Phi-2 by Microsoft is now MIT licensed and offers strong performance with only 2.7 billion parameters, while StableLM by Stability AI comes in at 2.8 billion parameters. FLAN-T5 has sizes as small as 80M parameters, and the new Qwen family of LLMs also has a 1.8B size.

Small models are also great for highly specific use cases, like generating SQL code. NSQL 350M has one-twentieth as many parameters as Mistral 7B but does a good job at its one task: generating SQL code.

The best open source LLM

There’s no one best open source LLM, only the LLM that’s best for you. This selection depends on capabilities, features, price point, and license. New models are released every day, and it can feel overwhelming to keep up. But finding the right model for your use case is possible with a bit of guidance and experimentation.

Deploy the best open source LLM for your use case in just a couple of clicks: