Frontier performance research in production
Run your models with the lowest latency and the highest throughput with the Baseten Inference Runtime.
Experience the powerful, flexible Baseten Inference Runtime
Just run faster
Best-in-class AI products provide instant responses. Models on Baseten are reliably fast with low p99 latencies for consistently delightful end-user experiences.
Control SLAs and economics
Model performance is about navigating tradeoffs. Access the entire frontier between latency and throughput to find the right balance for your speed and cost targets.
Don't compromise quality
We don't believe in black-box optimizations. Every model performance technique used in our inference stack is rigorously tested and fully configurable.
Every model performance technique in one configurable runtime
Automatic runtime builds
We take the best open-source inference frameworks (TensorRT, SGLang, vLLM, TGI, TEI, and more) and layer in our own optimizations for maximum performance. Configure runtimes in minutes (vs. hours) from a single file with full support for every relevant model performance technique.
Reliable speculation engine
We natively support speculative decoding and self-speculative techniques like Medusa and Eagle. With the model orchestration fully abstracted, you can fine-tune parameters or leverage pre-built configs directly with dynamic speculative decoding selection and online speculator training.
Modality-specific optimization
Different modalities (language, audio, speech synthesis, embeddings, image and video generation) require different techniques. From leveraging TensorRT-LLM across any autoregressive token-based transformers model to compiling diffusor models, our runtime adapts to any architecture.
Custom kernels
We use kernel fusion to reduce overhead by combining multiple operations (e.g., matrix multiplication, bias addition, activation functions) into a single GPU kernel along with memory hierarchy optimization, asynchronous compute and PDL for better memory and GPU utilization.
Structured output
Hoping for JSON isn't enough. Our runtime guarantees spec adherence for structured output by biasing logits according to a state machine generated prior to decode, ensuring no reduction in inter-token latency. This same system enables tool use for models that support function calls.
Optional quantization
Post-training quantization, especially in floating-point formats, can massively improve performance while preserving quality with minimal perplexity gain. We support many approaches, including KV cache quantization, while giving you full control over quantization decisions.
KV Cache optimization
With model contexts getting longer, reusing KV cache is essential for maintaining low time-to-first-token in responses. With KV cache offloading and cache-aware routing, our inference runtime stores and shares caches across GPU, CPU, and system memory while maximizing hit rate.
Request prioritization
Prefill (generating the first token) is often both more computationally expensive and more urgent than decode (generating subsequent tokens), so our runtime can prioritize prefill steps over decode steps to keep time-to-first-token low. We also support disaggregated serving.
Topology-aware parallelism
When serving large models on multiple GPUs and across nodes, model parallelism strategies like tensor parallelism (TP) and expert parallelism (EP) minimize communication overhead. Our runtime blends TP and EP along with other parallelism techniques to serve large models efficiently.
Learn more
Talk to our engineersWith Baseten, we gained a lot of control over our entire inference pipeline and worked with Baseten’s team to optimize each step.
Sahaj Garg,
Co-Founder and CTO
With Baseten, we gained a lot of control over our entire inference pipeline and worked with Baseten’s team to optimize each step.