How Baseten achieved 2x faster inference with NVIDIA Dynamo

How Baseten uses NVIDIA Dynamo

NVIDIA Dynamo is built for large-scale LLM serving across distributed GPU clusters with high throughput and low latency. It includes features like disaggregated prefill and decode steps, KV cache-aware routing, KV cache-offload to storage, an SLA-based planner for autoscaling, and dynamic GPU scheduling. Across all models, we’ve seen huge performance improvements by using Dynamo’s KV cache-aware routing—those benefits are what this blog focuses on.

The KV cache stores a model’s previously computed key/value states for past tokens, so it can reuse them instead of recomputing them with each new request. This speeds up inference, especially for long-context generation. KV cache-aware routing (NVIDIA also calls this “LLM-aware request routing”) makes sure that incoming requests are sent to the model replicas with previously stored context.

How KV-aware routing works

The NVIDIA Dynamo LLM Aware Router manages KV cache across large GPU fleets in multinode and disaggregated inference deployments and supports different inference backends like SGLang, TensorRT-LLM, and vLLM. Hashing incoming requests and organizing them in a Radix Tree enables scalable tracking of cache locations in distributed environments.

When new inference requests arrive, the LLM Aware Router calculates an overlap score between the request and the KV cache blocks already active across all GPUs in the cluster. Based on this overlap and the current distribution of GPU workload, it routes requests to the most suitable workers, minimizing KV cache recomputation while maintaining balanced load across the cluster.

Unlike round-robin or purely load-based routing, this method improves overall system performance by considering cache hit rate and workload balance, ensuring efficient request handling and reducing resource bottlenecks.

Description: KV-aware routing sends requests to replicas that already have relevant context cached, saving time by avoiding redundant prefill computation.

KV-aware routing sends requests to replicas that already have relevant context cached, saving time by avoiding redundant prefill computation.

For instance, if you use one of our Model APIs to chat with DeepSeek R1, new requests (new inputs to the model) will be routed to the replica that has the most optimal combination of KV cache overlap and GPU load. We love that Dynamo also lets us easily add custom routing logic, so we can use a mix between KV routing and round-robin routing as it best suits the model and use case.

KV-aware routing is especially useful for large models with long context windows, like in code generation use cases with long-context requests. Relevant models here are DeepSeek V3.1 and DeepSeek R1 0528, Kimi K2, and Qwen3 Coder.

Qwen3 Coder benchmarks with KV routing

Qwen3 Coder 480B A35B is a popular LLM in the Qwen model family optimized for tasks like code writing, debugging, and tool use. With a native context window of 262K tokens, it has one of the largest context windows that users actively use to process large code bases.

To show the impact of NVIDIA Dynamo’s KV cache-aware routing, we conducted a high-load stress test with long inputs (~50k tokens on average) and outputs (~1k tokens on average), toggling KV routing on and off (in the latter case, the routing is random). We see a significant performance improvement in terms of time to first token (TTFT, measured in milliseconds) and time per output token (TPOT, also measured in milliseconds): KV routing results in a 34% reduction in TPOT and 50% reduction in TTFT on average.

Description: KV routing results in a 34% reduction in TPOT and 50% reduction in TTFT on average with an 89% hit rate across four replicas.

Because some benchmarks can be point-in-time, we ran an additional benchmark by shadowing real production traffic — with a measurable impact on the client side — from OpenRouter. We saw similar results: a 48% decrease in P95 latency with KV cache-aware routing, and a 49% decrease in P99 latency.

Description: We see a 48% decrease in P95 latency when KV cache-aware routing is turned on, and a 49% decrease in P99 latency.

Reductions in latency (and recalculations of the KV cache for repeat requests) mean GPUs are freed up more quickly to serve new requests, affecting overall throughput. We also see 61% more requests processed per second (RPS) and a 62% increase in output tokens per second (TPS) overall.

Qwen3 480B Coder Throughput (sec)

Looking forward

We are continually deepening our usage of NVIDIA tooling. Looking forward, we’re benchmarking the impact of Dynamo features like KV cache offloading and transfer to expand context lengths, and to improve resource utilization, concurrency, and throughput.

Check out our Model APIs for an easy introduction to Baseten.