The best open source large language model
Large language models (LLMs) are the definitive category in generative AI. But with tens of thousands of options, it can be hard to feel confident about making the right tradeoffs between output quality, speed, and cost — especially when models specialize in different tasks.
Taking a holistic view across technical specifications, customer conversations, and our own testing, we’ve put together this list of models to guide you in finding the right starting place for building on top of open source text generation models for chat, code completion, retrieval-augmented generation, and more LLM use cases.
Best overall open source LLM: Llama 3.1 70B Instruct
Meta's latest LLM family, Llama 3.1, offers 8B, 70B, and 405B parameter instruct-tuned models with excellent benchmark performance. The larger model, Llama 3.1 70B Instruct, compares favorably to GPT-3.5, Gemini Pro 1.5, and Claude 3 Sonnet. The models were trained on over 15 trillion tokens, with an emphasis on code and a knowledge cutoff date of December 2023 for Llama 3 70B (March 2024 for Llama 3 8B).
What we love about Llama 3.1 70B Instruct:
128k-token context window with excellent retrieval benchmarks for building RAG-type applications.
Along with Meta's systematic investments in safety, Llama 3 models have been instruct-tuned to reduce false refusal rates.
Strong code generation and mathematical reasoning capabilities in a general model.
New, more efficient tokenizer yields up to 15% fewer tokens, meaning you generate fewer tokens per request.
What to watch out for with Llama 3 70B Instruct:
Llama 3.1 70B only supports eight languages, while many models support 2-3x as many.
Llama 3.1 models have a custom commercial license that also applies to any fine-tuned derivatives.
Get started with Llama 3.1 70B Instruct or try the smaller but still excellent Llama 3.1 8B Instruct.
The best big LLM: Llama 3.1 405B Instruct
Llama 3.1 405B is an open-source model that truly rivals heavyweights like GPT-4o. While other large models like Mistral Large 2 and Cohere Command-R plus are also extremely powerful, Llama 405B is licensed for commercial use with restrictions that few startups or enterprises would run up against.
What we love about Llama 3.1 405B:
Benchmarks favorably against the best closed-source models and backs up those scores with excellent observed real-world performance.
Massive 128k-token context window for retrieval-augmented generation and tool use.
Built-in function calling with specialized tool and function tokens.
What to watch out for with Llama 3.1 405B:
The model is so large that it generally must be run at FP8 on H100 GPUs.
Inference is expensive even with optimizations, for many use cases a less-powerful model like Llama 3.1 70B will suffice at a lower cost.
Llama 3.1 models have a custom commercial license that also applies to any fine-tuned derivatives.
Contact us for access to Llama 3.1 405B.
Best small LLM under 7 billion parameters: Phi 3 Mini
On the opposite end of the spectrum, Phi 3 Mini is an open source instruct-tuned LLM by Microsoft that achieves state of the art performance for models of its size at just 3.8 billion parameters. Phi 3 Mini runs fast on cheap hardware, making it a strong option for low-cost inference.
What we love about Phi 3 Mini:
Excellent output quality rivals 7B LLMs from just a few months ago.
128k-token context window variant allows for unprecedented use cases for models of this size class.
Permissive MIT license for unrestricted commercial use.
What to watch out for with Phi 3 Mini models:
While the LLM is outstanding for its class, output quality falls behind larger models, especially for factual recall.
The 4k-token context window variant consistently scores slightly higher on evals; only use the 128k-token variant when the increased context window is strictly necessary.
Phi 3 is an English-only model.
Deploy Phi 3 Mini 4k (or the 128k variant) on a T4 GPU.
Another great LLM: Mixtral 8x7B Instruct
Released by Mistral AI in December 2023, Mixtral 8x7B is a midsize LLM that uses a mixture-of-experts architecture to deliver incredible output quality with just 46.7 billion parameters. The model is licensed for commercial use and performs well across a wide range of tasks, including code generation. The instruct variant is fine tuned for chat usage and is aligned without overshooting the mark (meaning it’ll tell you how to kill a Docker container).
What we love about Mixtral 8x7B:
High output quality in all of our testing (plus with state-of-the-art performance on evaluation benchmarks).
32k-token context window supports almost any use case plus retrieval-augmented generation.
Efficient inference on just one A100 for reasonable operating costs.
Permissive Apache 2.0 license for unrestricted commercial use.
What to watch out for with Mixtral 8x7B:
Batching model requests reduces efficiency gains from Mixture of Experts architecture.
Code generation capabilities, while decent overall, fall short of specialized models.
Light-touch alignment may not be suitable for all use cases.
Run Mixtral 8x7B optimized with TensorRT-LLM.
Best aligned chat LLM: Zephyr model family
Alignment is a tricky balance. On the one hand, you want a model that isn’t going to generate hurtful or incorrect content. On the other hand, you want the model to stay useful and not refuse to answer genuine queries. Hugging Face’s H4 research team is working on this problem with their Zephyr LLMs.
What we love about Zephyr models:
Helpful assistant behavior boosts output quality both on evaluation benchmarks and ordinary use.
Supports ChatCompletions-style roles out of the box.
Based on Mistral, Zephyr 7B and 8x22B inherit permissive commercial licensing.
What to watch out for with Zephyr models:
Zephyr 7B has not been through more advanced in-the-loop alignment for safety and can generate problematic output when directly prompted.
Zephyr 7B doesn’t do well for math, code generation, and similar topics.
Still under active development, so you may need to upgrade to a new version soon for best performance.
Use Zephyr 7B with ChatCompletions-compatible API endpoints.
Best ML model for code generation: Code Llama
Code Llama is a project by Meta to fine tune their Llama 2 family of models to specialize in code generation tasks. The Code Llama family has nine models, as the model is available in four sizes (7B, 13B, 34B, and the new 70B) across three variants (Base, Instruct, and Python).
The models were trained on over 500 billion tokens of code, with additional specialized training variant-by-variant (e.g. the Python variant was trained on another 100 billion tokens of Python code). The largest models (34B and 70B parameters) outperform GPT on evaluation benchmarks targeted at code generation.
What we love about Code Llama:
70B Instruct variant scores 67.8 on HumanEval (pass@1) vs 67.0 for GPT-4.
Four sizes (7B, 13B, 34B, and 70B) and three variants (Base, Instruct, and Python) for maximum flexibility.
7B and 13B sizes are lower latency than 34B and have built-in code completion capabilities.
Large context window (up to 100K tokens) is essential for working with code as context (code is much more token-dense than natural language).
What to watch out for:
The most powerful 34B and 70B models — the only ones to surpass GPT on benchmarks — do not have code completion capabilities out of the box.
Only the Instruct variant is capable of responding to natural language prompts, the other two variants are code completion models.
Llama models have a special license that also applies to Code Llama models.
Try Code Llama 7B Instruct for chat-based coding.
Best model for fine tuning: Llama 3.1
The Llama 3.1 family of LLMs offers the most flexibility for fine tuning projects across size (8B, 70B, 405B) and focus (base and code variants). Given Llama 3 models’ strong base performance, any model from the family is a powerful foundation to build on. That’s why Llama 3.1 is the foundation model of choice for projects like WizardLM.
What we love about Llama 3.1 for fine tuning:
Base models in 2 different sizes (8B, 70B, 405B) lets you make tradeoffs between cost and performance.
New Llama 3.1 license explicitly allows for derivatives and teacher models.
Llama models have a history as popular choices for fine tuning work, so there’s plenty of research and tooling to build on.
What to watch out for:
Heavy-handed alignment in the instruct/chat variants of the models means you may need to start from scratch from the base variant.
Llama 3.1 models have a special license that also applies to fine tuned variants.
Experiment with Llama 3.1 8B Instruct on autoscaling infrastructure.
Can open source models replace OpenAI and ChatGPT?
Yes. Llama 3.1 405B compares favorably to GPT-4o on most benchmarks.
Newer open source LLMs like Llama 3 70B Instruct compare favorably to closed-source options like GPT-3.5, Gemini Pro 1.5, and Claude 3 Sonnet. And with fine-tuning, open source models can match or beat the best closed-source models for specific tasks at much lower costs.
How much should I trust model evaluation benchmarks?
Model evaluation benchmarks measure an LLM’s performance on a fixed set of tasks. These benchmarks are designed to assess the accuracy and quality of the model’s output. While there is no universal standard benchmark, there are a number of popular options including ARC, HellaSwag, and MMLU.
There are some worries about the usefulness of evaluation benchmarks. Generally, evaluation benchmarks could be too narrow to fully capture a model’s performance, and more recently there have been concerns about evaluation sets leaking into models’ training data. These problems have solutions. It’s standard practice to look at a model’s average performance across several benchmarks to account for the limitations of any one benchmark, and researchers check for contamination of their training data before releasing models.
Benchmarks performance is a solid signal when picking an LLM, but isn’t the whole story. There’s no need to switch models every time a new variant comes out with a slight uptick in benchmark score, and the most important thing to do is evaluate model output for your exact use case.
What about models smaller than 7 billion parameters?
Small models can run fast on less expensive, more available GPUs, making them a cost-effective choice for simple use cases.
Until recently, LLMs smaller than 7 billion parameters haven’t been ready for production use, but new models are changing this. Phi 3 Mini by Microsoft, our recommended model, is now MIT licensed and offers strong performance with only 3.8 billion parameters, while StableLM by Stability AI comes in at 2.8 billion parameters. FLAN-T5 has sizes as small as 80M parameters, and the new Qwen family of LLMs also has a 1.8B size.
Small models are also great for highly specific use cases, like generating SQL code. NSQL 350M has one-twentieth as many parameters as Mistral 7B but does a good job at its one task: generating SQL code.
The best open source LLM
There’s no one best open source LLM, only the LLM that’s best for you. This selection depends on capabilities, features, price point, and license. New models are released every day, and it can feel overwhelming to keep up. But finding the right model for your use case is possible with a bit of guidance and experimentation.
Deploy the best open source LLM for your use case in just a couple of clicks:
The best overall open source LLM: Llama 3.1 70B Instruct
The best big LLM: Llama 3.1 405B Instruct
The best small LLM under 7 billion parameters: Phi 3 Mini
Another great open source LLM: Mixtral 8x7B
The best aligned chat LLM: Zephyr 7B
The best LLM for code generation: Code Llama
The best LLM for fine tuning: Llama 3.1