How SpeechifyAI built an ultra-low-latency voice agent without compromising on voice quality

Company overview

SpeechifyAI is the developer platform arm of Speechify, the world's leading voice AI productivity platform bringing voice intelligence to developers and enterprises through APIs and platforms for text-to-speech and voice agents. Speechify was founded by Cliff Weitzman to help people with dyslexia keep up with reading, and has since grown to serve more than 50 million users across iOS, Android, Mac, Windows, and Chrome, with over 1,000 voices in 60 languages.

Challenge

When Speechify decided to build an enterprise voice agent product, it started with a significant advantage: a proprietary text-to-speech (TTS) model, SIMBA, that outperformed everything on the open market while remaining highly cost-effective. This was a critical business edge, as TTS is historically the most expensive component of any voice agent pipeline.

But a great TTS model alone doesn't make a voice agent; a production-ready pipeline also requires automatic speech recognition (ASR) to transcribe callers in real time, and a large language model (LLM) to interpret intent and trigger the right actions. These models all need to be running in sequence, in the same region, and fast enough that no one on the other end of the line feels the conversation stall.

Finally, in a live phone call, even 300–400 ms of added latency is enough to break the experience. Speechify also needed a single infrastructure provider capable of hosting the entire three-model chain at the reliability and latency standards a real conversation demands.

"The voice model is the most crucial single point of failure — you cannot failover to a third-party provider because the voice drift is directly noticeable. That's not like a slightly slower ASR or a different LLM response."
Kai Krause, VP of Engineering and AI, Speechify

Solution

Baseten hosts the full three-model chain: streaming Whisper for ASR, GLM-5.2 as the LLM, and SIMBA for TTS, all running in a single region to eliminate inter-service latency. Baseten's voice agent team put significant optimization work into streaming Whisper specifically, tuning it for the real-time demands of a live call rather than treating it as a standard transcription workload.

For the LLM layer, Speechify evaluated several open-source models before landing on GLM-5.2. The selection criteria weren't benchmark scores, they were voice-agent-specific: consistent low latency, reliable tool calling, and clean instruction-following on the kinds of imperfect, real-world ASR inputs a live call actually produces. GLM-5.2 fit that profile and runs on Baseten Model API’s alongside the rest of the stack.

With the pipeline in place, Speechify built redundancy into the ASR and LLM layers so either can failover mid-call without the caller noticing. SIMBA is a different story; voice drift mid-conversation is immediately audible, so it doesn't failover to a third-party provider. That makes the reliability of the Baseten deployment the direct constraint on call quality.

Results

Speechify's session orchestrator runs in US-East. On Baseten, all three models — streaming Whisper for ASR, GLM-5.2 for the LLM, and SIMBA for TTS — run in that same region, in the same facility. Every conversational turn round-trips from the orchestrator to ASR, then the LLM, then TTS; co-located, each is a ~1–2 ms same-region hop.

Stitch the same pipeline out of separate cloud APIs, and each model lives wherever that vendor runs it — often a different region, frequently not region-pinned at all — so every per-turn hop carries an inter-region round-trip of 25–70 ms on top of model compute, plus the jitter of public, multi-vendor routing. Co-location strips ~75–150 ms of pure network latency out of every turn and removes the turn-to-turn variance that makes a stitched pipeline feel unpredictable. At a threshold where 300–400 ms breaks the experience, stripping out network latency is one of the highest-leverage changes in the stack.

Speechify's voice agent pipeline runs entirely on Baseten, with SIMBA handling every call as the non-substitutable voice layer and GLM-5.2 providing the LLM reasoning. The full chain beats closed API provider benchmarks on end-to-end latency, while giving Speechify the flexibility to swap or fine-tune any model in the stack as the product matures. That infrastructure efficiency flows straight into pricing: SIMBA text-to-speech starts at $6 per 1M characters, and voice agents are priced all-in by the minute from $0.068/min, with no separate LLM, speech-to-text, or telephony passthrough.

When you build a voice agent out of separate cloud APIs, every turn pays a network tax — you're reaching out to ASR, an LLM, and TTS, and if any of them isn't in your region, that's tens of milliseconds of round-trip on top of compute, on every turn. Co-locating all three on Baseten in one region collapses that to almost nothing."
Kai Krause, VP of Engineering and AI, Speechify

How SpeechifyAI built an ultra-low-latency voice agent without compromising on voice quality

75–150 ms

420-480 ms

Company overview

Challenge

Solution

Results

Chosen by the world's most ambitious builders

Explore Baseten today