The best open-source large language models (LLMs)

New LLMs drop every day. With so many choices, it might be hard to select the right one! You might have a variety of tasks in front of you that could require coding, agentic workflows, or long context reasoning, for example. Whatever your use case is, we want to help you find the best open-source LLMs for it.

In this blog, we’ll review the technical specifications, benchmarks, and first-hand data from models we currently run for customers in production at Baseten.

DeepSeek V4 Pro

DeepSeek V4 Pro is a frontier open-source model built for agentic coding and complex STEM reasoning. Its architecture allows for a 1M-token context window through a hybrid attention mechanism:

CSA (Compressed Sparse Attention) shrinks the KV cache: instead of keeping one entry for every token, it summarizes groups of tokens into a smaller set of compressed entries. (Keys help the model figure out which words to pay attention to, and values determine what information gets added to a word's meaning based on which is relevant. They are collectively cached as the “KV cache”.) It then attends sparsely over that compressed sequence: each token picks out only a handful of the most relevant compressed blocks instead of reading everything, so it keeps detail where it matters while cutting memory.

HCA (Heavily Compressed Attention) compresses much more aggressively, squeezing large spans of tokens into a single entry, but drops the sparse selection. Because the compressed sequence is now so short, the model can afford to attend densely across all of it. Together, CSA and HCA reduce KV cache memory to roughly 10% of what standard models require.

Manifold-Constrained Hyper-Connections (mHC) keep training stable at scale, allowing DeepSeek to reliably train models with massive context windows by replacing the standard residual addition between layers with a mathematically constrained mixing operation.

What we love about DeepSeek V4 Pro:

DeepSeek V4 Pro leads open-source models on agentic coding and matches closed-source frontier models
1M token context window to feed it an entire codebase or document library in a single call
CSA + HCA reduce KV cache memory to ~2% of standard transformer requirements which lowers inference costs and allows more requests to run on the same GPU
mHC keeps training stable at scale, enabling reliable performance across massive contexts
Significantly cheaper than comparable closed-source models on mid-to-high complexity tasks

You can try DeepSeek V4 Pro on Baseten.

Gemma 4

Google DeepMind’s open-weight Gemma model is built for enterprise fine-tuning and multimodal reasoning.

Gemma 4 alternates between sliding window attention, where each token only looks at a fixed window of neighboring tokens, and sparsely placed global attention layers, where every token attends to every other token in the full sequence. This cuts compute overhead sharply while preserving long-range reasoning.

Context windows go up to 128K on edge deployments and 256K in the cloud.

There are two notable Gemma 4 models:

Gemma 4 12B is encoder-free, with native audio support. The model is best for local deployment and agentic workflows on consumer hardware.
Gemma 4 31B is a more capable dense model, better for high-quality outputs where hardware isn’t a constraint. It does not have native audio and uses more VRAM.

What we love about Gemma 4:

Great for agentic apps and fine-tuning custom models
Interleaved local/global attention reduces memory and compute costs
Up to 256K context window in cloud deployments
Strong multimodal reasoning and agentic workflow performance

Try Gemma 4 on Baseten.

GLM-5.2

GLM-5.2 from Zhipu AI (commonly called Z.ai) is built for long-horizon coding tasks that would require an agent to run for hours. It uses a Mixture of Experts (MoE) architecture with 256 experts, routing just 8 experts per token, so you get the reasoning capacity of a massive model at a lower compute cost.

GLM’s sparse attention compresses all Key and Value matrices per layer into a single small latent matrix, reconstructing the full matrices on the fly. This reduces the KV cache size by roughly the number of attention heads.

To reduce compute cost, GLM-5.2 added IndexShare. Layer 1 of the group runs the indexer and picks the top‑k token indices to attend to. Layers 2–4 reuse those same indices instead of recomputing them across every transformer layer.

GLM-5.2 actively manages its own working memory rather than just accumulating context, letting it iterate and self-correct through runs spanning hours to tens of hours.

What we love about GLM-5.2:

Stronger multi-hour agentic coding workflows than GLM 5.1
Dynamic working memory enables up to tens of hours of autonomous execution without context overload, making it well-suited for coding agents
1M context window via sparse attention and IndexShare

Experiment with GLM-5.2 on Baseten.

GPT OSS 120B

OpenAI's open-weight reasoning model, GPT OSS, is optimized for text generation and conversational AI. It’s cost-efficient and hits 650+ tokens/second on Baseten with NVIDIA GPUs, making it one of the fastest and most affordable 120B models available.

Baseten reached 650+ TPS on GPT OSS 120B by using TensorRT-LLM on B200 GPUs and NVIDIA Dynamo (for KV cache-aware routing) at launch, then added EAGLE-3 speculative decoding, which uses a smaller ~1B draft model to predict tokens in parallel and delivers a 60% speed boost.

What we love about GPT OSS 120B:

650+ tokens/second on Baseten; one of the fastest 120B models available
Optimized for consumer hardware (~80GB), making deployment straightforward
One of the cheapest 120B models to run without sacrificing generation quality

You can deploy GPT OSS 120B on Baseten.

Kimi K2.7 Code

Notable improvements in Kimi K2.7 Code over Kimi K2.6 include stronger coding and agent performance on long-horizon, repo-level tasks.

While Kimi K2.6 is already highly reliable for day-to-day programming, K2.7 is tuned for multi-step software engineering workflows: planning, editing, debugging, and iterating across many steps.

K2.7 Code is trained on 15.5T tokens, with code-focused training that improves coding performance while using fewer reasoning tokens. Lower thinking-token usage makes K2.7 Code more efficient over long agent sessions using ~30% fewer tokens per coding task.

Unlike Kimi K2.6, which would let you toggle between thinking and non-thinking modes, K2.7 is always in a mandatory "thinking mode" for deeper reasoning.

What we love about Kimi K2.7 Code:

Built for multi-step software engineering (plan → edit → debug across many steps)
Cheaper and faster coding workflows: uses fewer tokens and less unnecessary reasoning, which lowers inference cost
More reliable agentic execution

Try Kimi K2.7 Code on Baseten.

Kimi K3

Kimi K3 is Moonshot AI's 2.8-trillion-parameter model, built for long-horizon coding sessions that run autonomously for hours. The model is rewriting the stack it runs on: an early K3 handled most of the team’s kernel optimization work during development. K3 has native vision, a 1M-token context window, and uses Stable LatentMoE, activating just 16 of 896 experts per token.

AttnRes (Attention Residuals): To generate a token, the input passes through a stack of layers. Usually, the outputs of these layers are summed to determine which token to generate. But, AttnRes multiplies each element of that sum by a specialized weight, letting the model give more importance to whichever layers are most useful in context instead of treating them all equally.

KDA + MLA (Kimi Delta Attention + Multi-head Latent Attention) tackle the long-context problem: the KV cache grows with every token, so generation slows as context gets longer. KDA replaces the growing cache with a fixed-size memory that updates as tokens arrive. But fixed memory can’t perfectly recall everything as well as a full KV cache can, so that’s why Kimi K3 interleaves MLA layers, which keep full recall efficiently by compressing each token's KV into a small latent vector. This keeps compute low and recall accurate as context grows.

Why we love Kimi K3:

Long-horizon agentic coding: sustained multi-hour autonomous sessions
Native multimodality with vision-in-the-loop: iterates between code and live screenshots for frontend, game dev, and CAD; strong 3D reasoning
Strong visual output: turns research into interactive dashboards and frontend UI
Great for creative writing: one of the best models for writing, with a style that stands out from other LLMs

Try Kimi K3 on Baseten.

MiniMax M3

MiniMax M3 is a strong choice for frontend and UI work, visual reasoning, and creative tasks. It supports a 1M token context window using MiniMax Sparse Attention (MSA), which keeps memory costs manageable at scale. It produces clean results on design-adjacent tasks: UI generation, code review with visual context, creative writing.

What we love about MiniMax M3:

1M token context window via sparse attention, without the memory costs of standard transformers
Stands out for frontend/UI generation and visual reasoning tasks
Strong on creative and design-heavy workflows

Try MiniMax M3 on Baseten.

Nemotron 3 Ultra

NVIDIA’s 550B model is a frontier open-weight mixture of experts (MoE) language model designed for long-running agents. You get the knowledge capacity of a 550B model at the cost of running a 55B one. Its hybrid Mamba-Transformer architecture keeps inference time roughly flat as context grows. On long-running agentic workflows, that translates to up to 5x faster inference and up to 30% lower cost.

Ultra is one of three models in NVIDIA's Nemotron 3 family, alongside Nano and Super. The three are designed to be complementary.

What we love about Nemotron 3 Ultra:

1M-token context with hybrid Mamba-Transformer: step time stays flat as context scales
Built for long-running agents and long-context enterprise workloads
Sets a thinking token limit per step to avoid over-reasoning on simple tasks
Affordable for a 1M token context window model

Try Nemotron 3 Ultra on Baseten.

Qwen 3.6

The Qwen 3.6 open-source family includes two models: the 27B dense and the 35B-A3B MoE (3B active params). For agentic coding specifically, Qwen 3.6 provides repo-level reasoning and strong frontend workflow performance.

Notable improvements over Qwen 3.5 include natively multimodal support across both models, which handles images and video in a single checkpoint. On the coding side, Qwen 3.5 handled general programming but struggled with complex, repo-level tasks. The Qwen 3.6 27B model outperforms the previous 397B open-source flagship on every major coding benchmark, despite being a fraction of the size.

What we love about Qwen 3.6:

Runs well on consumer hardware across the family
Top-tier agentic coding: repo-level reasoning with minimal mid-task corrections
Reliable performance across frontend and devops workflows
Natively multimodal (text, image, video) in both open-source models

Choose a model from the Qwen 3.6 family: 27B, 35B-A3B

When should you trust benchmarks?

Benchmarks are a starting point. A model that tops a leaderboard may underperform on your specific task. You should run evaluations on your own task workload to find the model that will work for you. Once you've chosen a model, you can optimize inference performance metrics like TTFT, TPS, and end-to-end latency.

The best open-source LLM

The best open-source LLM depends on your workload and what you’re optimizing for. All models mentioned here are used in production today for many AI applications.

FAQs

What is a context window?

The maximum amount of text a model can process at once: your prompt, conversation history, documents, and output combined. A 1M token window means you can usually include entire codebases or document libraries in a single call.

What is Mixture of Experts (MoE)?

An architecture where the model is divided into many specialized sub-networks, but only a small subset activates for any given input. You get the reasoning capacity of a much larger model at a fraction of the inference cost. MoE doesn't use less memory: all the parameter weights need to be stored in VRAM.

MoE was popularized by DeepSeek R1 and is the typical architecture used in LLMs today.

What is a dense model?

A dense model means all parameters are active for every token/input during the forward pass, as opposed to a Mixture of Experts (MoE) model.

What does “open-weight” mean? Is it the same as open-source?

Open-weight means the model’s trained weights are public. You can then run inference and post-training with an infrastructure platform like Baseten or yourself.

Open-source means the training code, data, and methodology are also public. Most models here are open-weight but not fully open-source.

Which open source LLMs can run locally on consumer hardware?

Qwen 3.6 and Gemma 4 12B are the best options for running LLMs locally; both run well on consumer hardware. GPT OSS 120B needs ~80GB VRAM but is one of the cheapest 120B models available.

What is KV cache-aware routing?

KV cache-aware routing sends requests to the GPU worker that already has the relevant cache loaded, cutting time-to-first-token.

The best open-source large language models (LLMs)

Authors

Last updated

Share