We evaluated eight of the best open-source models: DeepSeek V4 Pro, Gemma 4, GLM 5.1, GPT OSS 120B, Kimi K2.6, MiniMax M3, Nemotron 3 Ultra, and Qwen 3.6. Kimi K2.6 is the most well-rounded; Qwen 3.6 and GLM 5.1 lead for agentic coding; DeepSeek and Nemotron dominate long-context and enterprise workloads; and GPT OSS 120B performs well on cost and speed.
New LLMs drop every day. With so many choices, it might be hard to select the right one! You might have a variety of tasks in front of you that could require coding, agentic workflows, or long context reasoning, for example. Whatever your use case is, we want to help you find the best open-source LLMs for it.
In this blog, we’ll review the technical specifications, benchmarks, and first-hand data from models we currently run for customers in production at Baseten.
DeepSeek V4 Pro
DeepSeek V4 Pro is a frontier open-source model built for agentic coding and complex STEM reasoning. Its architecture allows for a 1M-token context window through a hybrid attention mechanism:
CSA (Cross-layer Shared Attention) reuses the Key and Value matrices from the first layer across all subsequent layers. (Keys help the model figure out which words to pay attention to, and values determine what information gets added to a word's meaning based on which is relevant. They are collectively cached as the “KV cache”.) Key and Value matrices don’t change much between layers, so sharing them cuts memory significantly with only a minor quality tradeoff.
HCA (Hierarchical Chunked Attention) splits the sequence into chunks: each token attends precisely to others in its own chunk, and coarsely to summaries of distant chunks. Close tokens get specific attention; far tokens get averaged context. Together, CSA and HCA reduce KV cache memory to roughly 2% of what standard models require.
Manifold-Constrained Hyper-Connections (mHC) keep training stable at scale, allowing DeepSeek to reliably train models with massive context windows by replacing the standard residual addition between layers with a mathematically constrained mixing operation.
What we love about DeepSeek V4 Pro:
DeepSeek V4 Pro leads open-source models on agentic coding and matches closed-source frontier model
1M token context window to feed it an entire codebase or document library in a single call
CSA + HCA reduce KV cache memory to ~2% of standard transformer requirements which lowers inference costs and allows more requests to run on the same GPU
mHC keeps training stable at scale, enabling reliable performance across massive contexts
Significantly cheaper than comparable closed-source models on mid-to-high complexity tasks
You can try DeepSeek V4 Pro on Baseten.
Gemma 4
Google DeepMind’s open-weight Gemma model is built for enterprise fine-tuning and multimodal reasoning.
Gemma 4 alternates between sliding window attention, where each token only looks at a fixed window of neighboring tokens, and sparsely placed global attention layers, where every token attends to every other token in the full sequence. This cuts compute overhead sharply while preserving long-range reasoning.
Context windows go up to 128K on edge deployments and 256K in the cloud.
There are two notable Gemma 4 models:
Gemma 4 12B is encoder-free, with native audio support. The model is best for local deployment and agentic workflows on consumer hardware.
Gemma 4 31B is a more capable dense model, better for high-quality outputs where hardware isn’t a constraint. It does not have native audio and uses more VRAM.
What we love about Gemma 4:
Great for agentic apps and fine-tuning custom models
Interleaved local/global attention reduces memory and compute costs
Up to 256K context window in cloud deployments
Strong multimodal reasoning and agentic workflow performance
GLM 5.1
GLM 5.1 from Zhipu AI (commonly called Z.ai) is built for long-horizon coding tasks that would require an agent to run for hours. It uses a Mixture of Experts (MoE) architecture with 256 experts, routing just 8 experts per token, so you get the reasoning capacity of a massive model at a lower compute cost.
GLM’s sparse attention compresses all Key and Value matrices per layer into a single small latent matrix, reconstructing the full matrices on the fly. This reduces the KV cache size by roughly the number of attention heads.
More unusually, GLM 5.1 actively manages its own working memory rather than just accumulating context, letting it iterate and self-correct through runs of up to 8 hours. Baseten is one of the fastest providers for GLM-5.1 — more than 2× faster than most alternatives.
What we love about GLM 5.1:
Optimized for multi-hour agentic coding workflows
Dynamic working memory enables up to 8 hours of autonomous execution without context overload, making it well-suited for coding agents
200K context window via sparse KV cache compression
Experiment with GLM 5.1 on Baseten.
GPT OSS 120B
OpenAI's open-weight reasoning model, GPT OSS, is optimized for text generation and conversational AI. It’s cost-efficient and hits 650+ tokens/second on Baseten with NVIDIA GPUs, making it one of the fastest and most affordable 120B models available.
Baseten reached 650+ TPS on GPT OSS 120B by using TensorRT-LLM on B200 GPUs and NVIDIA Dynamo (for KV cache-aware routing) at launch, then added EAGLE-3 speculative decoding, which uses a smaller ~1B draft model to predict tokens in parallel and delivers a 60% speed boost.
What we love about GPT OSS 120B:
650+ tokens/second on Baseten; one of the fastest 120B models available
Optimized for consumer hardware (~80GB), making deployment straightforward
One of the cheapest 120B models to run without sacrificing generation quality
You can deploy GPT OSS 120B on Baseten.
Kimi K2.6
Moonshot AI's 1-trillion-parameter model is highly reliable for coding workloads. Built around the Kimi Code engine, it can handle large codebases and build interfaces directly from visual mockups.
Kimi K2.6 has multimodal support via MoonVit (a 400M visual encoder) and can take text, images, and video as input.
What we love about Kimi K2.6:
Reliable, high-quality coding across frontend, DevOps, and performance optimization
Generates code from visual specs
Native multimodal input: text, images, and video via MoonVit
Optimized for SWE workflows in Rust, Go, and Python
MiniMax M3
MiniMax M3 is a strong choice for frontend and UI work, visual reasoning, and creative tasks. It supports a 1M token context window using MiniMax Sparse Attention (MSA), which keeps memory costs manageable at scale. It produces clean results on design-adjacent tasks: UI generation, code review with visual context, creative writing.
What we love about MiniMax M3:
1M token context window via sparse attention, without the memory costs of standard transformers
Stands out for frontend/UI generation and visual reasoning tasks
Strong on creative and design-heavy workflows
Nemotron 3 Ultra
NVIDIA’s 550B model is a frontier open-weight mixture of experts (MoE) language model designed for long-running agents. You get the knowledge capacity of a 550B model at the cost of running a 55B one. Its hybrid Mamba-Transformer architecture keeps inference time roughly flat as context grows. On long-running agentic workflows, that translates to up to 5x faster inference and up to 30% lower cost.
Ultra is one of three models in NVIDIA's Nemotron 3 family, alongside Nano and Super. The three are designed to be complementary.
What we love about Nemotron 3 Ultra:
1M-token context with hybrid Mamba-Transformer: step time stays flat as context scales
Built for long-running agents and long-context enterprise workloads
Sets a thinking token limit per step to avoid over-reasoning on simple tasks
Affordable for a 1M token context window model
Try Nemotron 3 Ultra on Baseten.
Qwen 3.6
The Qwen 3.6 open-source family includes two models: the 27B dense and the 35B-A3B MoE (3B active params). For agentic coding specifically, Qwen 3.6 provides repo-level reasoning and strong frontend workflow performance.
Notable improvements over Qwen 3.5 include natively multimodal support across both models, which handles images and video in a single checkpoint. On the coding side, Qwen 3.5 handled general programming but struggled with complex, repo-level tasks. The Qwen 3.6 27B model outperforms the previous 397B open-source flagship on every major coding benchmark, despite being a fraction of the size.
What we love about Qwen 3.6:
Runs well on consumer hardware across the family
Top-tier agentic coding: repo-level reasoning with minimal mid-task corrections
Reliable performance across frontend and devops workflows
Natively multimodal (text, image, video) in both open-source models
Choose a model from the Qwen 3.6 family: 27B, 35B-A3B
When should you trust benchmarks?
Benchmarks are a starting point. A model that tops a leaderboard may underperform on your specific task. You should run evaluations on your own task workload to find the model that will work for you. Once you've chosen a model, you can optimize inference performance metrics like TTFT, TPS, and end-to-end latency.
The best open-source LLM
The best open-source LLM depends on your workload and what you’re optimizing for. All models mentioned here are used in production today for many AI applications.
FAQs
What is a context window?
The maximum amount of text a model can process at once: your prompt, conversation history, documents, and output combined. A 1M token window means you can usually include entire codebases or document libraries in a single call.
What is Mixture of Experts (MoE)?
An architecture where the model is divided into many specialized sub-networks, but only a small subset activates for any given input. You get the reasoning capacity of a much larger model at a fraction of the inference cost. MoE doesn't use less memory: all the parameter weights need to be stored in VRAM.
MoE was popularized by DeepSeek R1 and is the typical architecture used in LLMs today.
What is a dense model?
A dense model means all parameters are active for every token/input during the forward pass, as opposed to a Mixture of Experts (MoE) model.
What does “open-weight” mean? Is it the same as open-source?
Open-weight means the model’s trained weights are public. You can then run inference and post-training with an infrastructure platform like Baseten or yourself.
Open-source means the training code, data, and methodology are also public. Most models here are open-weight but not fully open-source.
Which open source LLMs can run locally on consumer hardware?
Qwen 3.6 and Gemma 4 12B are the best options for running LLMs locally; both run well on consumer hardware. GPT OSS 120B needs ~80GB VRAM but is one of the cheapest 120B models available.
What is KV cache-aware routing?
KV cache-aware routing sends requests to the GPU worker that already has the relevant cache loaded, cutting time-to-first-token.