How OpenCode powers the AI coding agent developers rely on every day

Company overview

OpenCode is an open-source AI coding agent that has quickly become one of the fastest-growing tools in the developer AI ecosystem, with over 169k GitHub stars, 900 contributors, and 11k commits — trusted by 10M developers every month. OpenCode is an open source agent that helps you write code in your terminal, IDE or desktop. OpenCode provides developers with a model-agnostic coding experience by letting users choose from a curated set of quality high-performance models with the Zen gateway or with their low cost plans like OpenCode Go. Unlike tools that treat AI as an occasional assistant, OpenCode is built for developers who code with AI every day. The models OpenCode provides have to be fast, reliable, and consistent.

Impact highlights

5x TPS and higher stability versus closed-source providers
100+ TPS consistently across open-source models
Sub-second TTFT consistently across open-source models
Strongest provider for tool calling quality with no tradeoff in performance
33% blended cost reduction with KV cache aware routing and cache token pricing

Challenge

When OpenCode launched Zen, its model gateway for coding agents, it needed inference infrastructure that could match the performance and quality bar users already expected from closed-source models. The OpenCode team tested several inference providers for open-source models and hit a number of roadblocks: 1) lackluster performance (some API providers topped out at only 15 tps), 2) elevated tool calling error rates (up to 10%), 3) sluggish time to first token, 4) lack of zero data retention and US hosting.

OpenCode isn't a "nice to have." It's how developers work. Slow and inconsistent inference breaks developer workflow and productivity, and when developers can't work consistently, business outcomes slow down too. The fastest moving companies are the ones who harness AI in a stable, productive way. Beyond a high quality user experience, OpenCode's mission presented a structural economics challenge. Inference is a variable cost,but considering the impact of usage isn’t inherent to a developers workflow. Developers want to open their editor and code, a highly variable bill degrades their experience.

Making a coding agent cost efficient is a challenging inference problem, due to the types of requests coding agents serve. Agentic coding requests are often long and repetitive because the same context, code snippets, and files get re-sent with every request. OpenCode needed a provider offering cache token pricing to make running open-source models economical at scale.

"OpenCode isn’t something people use every once in a while - it's the default tool they use every day to do their work. It needs to feel the same every hour of every day. And inference is a huge part of creating that consistent experience, it has to be fast and reliable no matter when you’re sending requests."
Dax Raad, Co-founder

Solution

OpenCode utilized Baseten’s Model APIs, multi-tenant endpoints built for performance and reliability. Model APIs unlock the latest open-source models at the best available quality and performance, made possible by Baseten’s model performance team. The model performance team applies research techniques across speculative decoding, custom kernels, and proprietary inference runtimes to deliver the lowest possible latency and highest throughput. All of these techniques are part of Baseten’s model runtime.

A core part of Baseten's runtime is KV cache aware routing, which ensures each request is directed to the replica already holding a user's previous context. This means repeated inputs aren’t recomputed from scratch on every call. Strong cache aware routing ensures the highest possible cache hit rate which translates to better performance and lower cost (cache tokens are charged at a discounted rate).

Lastly, tool calling quality can vary significantly amongst inference providers as each provider implements different structured output protocols and parsing templates for each model. Baseten implements a number of techniques to ensure high completion rates for tool calls for both single and multi-turn tasks. OpenCode's internal benchmarks served as a strong validation of Baseten's tool calling quality, and surfaced actionable feedback for the Baseten team to implement to improve even more.

"Ever since the launch of Zen, we've gotten a lot of crazy feedback — it's fast and very good, like 90% as good as SOTA closed-source models, but it's just so fast that it changes people's behavior when they code."
Frank Wang, CTO

Result

Inference was so fast with Baseten, OpenCode added functionality for developers to review checkpoints, or places to pause and review what the agent had done before continuing.

High cache hit rates and cache token pricing enabled OpenCode to achieve a 33% blended cost reduction versus their previous providers. OpenCode passed those savings directly to their users with discounted usage for Zen users.

OpenCode has grown from roughly 40k MAUs at the start of the partnership to 10M MAUs, with Baseten serving as one of their core inference providers across multiple models amidst massive user growth.