How Speechify makes audio the default with real-time text-to-speech

Company overview

Speechify trains and operates its own SIMBA family of TTS models (including the new SIMBA 3.0 vLLM), alongside text normalization, voice conversion, sound FX, and page parsing models. In a typical month, Speechify’s TTS platform synthesizes more than 161 billion characters and serves nearly 1,000 requests per second through text normalization alone.

You can find more information about Speechify’s models and start using them at Speechify.ai.

Real-time text-to-speech at scale

Challenge

Before Speechify joined forces with Baseten, its workloads ran on a self-managed stack, peaking at around 1,500 GPUs across 18 zones in 7 regions. Speechify has a global user base, but this footprint was a reflection of the scarcity of GPUs to meet its level of scale and the difficulty of securing capacity.

With over 60M users, managing the inference infrastructure behind Speechify’s TTS platform was adding friction for its engineering teams. The infrastructure complexity grew so much that even if researchers had a new model ready, it would take days of work involving the entire platform team before it could be shipped to a customer.

The reason we came to Baseten in the first place was the complexity of managing our own infrastructure. Our priority is continuing to deliver the best TTS platform for our 60M+ users, and we didn’t want our inference infrastructure to stand in the way of that.
Kai Krause, VP of Engineering and AI

As the Speechify team prepared to launch its new SIMBA 3.0 vLLM model, real-time TTS latency was essential. Aside from the overhead of managing infrastructure, the Speechify team was also looking to lower their inference latency and cold start times (which took around 9-11 minutes) to lock in the best possible user experience.

Solution

Speechify worked with Baseten to drastically reduce operational complexity while increasing model performance.

The Speechify team now uses rolling deploys with fully-automated CI/CD for zero-downtime model promotions at scale. Rolling deploys provide Speechify with pre-promotion replica readiness gates and autoscaling continuity during rollouts.

A researcher shipped SIMBA 3.0 vLLM into production by themselves. That would have taken days of work from our entire AI Platform team with our old infrastructure stack.
Kai Krause, VP of Engineering and AI

Baseten now hosts more than 10 production model deployments for Speechify, including the full SIMBA TTS family across English and multilingual builds, as well as text normalization, voice conversion, sound FX, and page parsing. All have traffic-based autoscaling, comprehensive observability, and high performance out of the box with the Baseten Inference Stack.

Researchers ship new models using Baseten’s open-source model packaging tool, Truss (truss push), instead of filing platform engineering tickets. End-to-end product testing on an experimental checkpoint takes about an hour vs. days previously.

Baseten collapsed a per-model tower of Terraform, Envoy, and Filestore into a single "truss push".
Kai Krause, VP of Engineering and AI

Results

Now, the Speechify team can self-serve model deployments with fully automated CI/CD, reducing days of manual work. Its new SIMBA 3.0 vLLM models run on Baseten with 76 ms p50 TTFB. Because of Baseten’s efficient autoscaling, model performance and infrastructure optimizations, Speechify’s cost per million characters dropped by 44%, as traffic on its platform grew 7% during the same period.

In terms of management complexity, Speechify was able to sunset a massive estate of GPUs and corresponding orchestration technologies, including: 7 GKE clusters, ~940 self-managed GPUs across 18 zones, ~9 hand-tuned Envoy proxies, 7 regional Filestore instances, a cross-repo CD orchestration that triggered platform engineering actions for every release, per-service GCP service accounts, IAM bindings, image replication jobs, AOT cache population jobs, monitoring dashboards, alert policies, and 9 inference Terraform modules. Every layer is a category of incident, alert, runbook, and on-call rotation that no longer exists for the Speechify Platform Engineering team.

Across the SIMBA TTS family, p99 inference latency dropped 30–50% post-migration, even as peak traffic on actively-growing models grew 60–100% in the same window. Replica startup became 4.5x faster, latency spikes dropped by 50-70%, and the engineering team is considerably freed up to double down on their research and product advancements. The partnership continues to grow.

Working with the Baseten team was a no-brainer. They decreased our model latency by over 50%, reduced our cost per million characters by over 44%, and delivered the highest uptime of any inference provider we know of. Baseten has enabled Speechify to provide the highest-quality, lowest-latency, and most cost-efficient text-to-speech AI voice models in the world to consumers, developers, and enterprises.
Rohan Pavuluri, Chief Business Officer

How Speechify makes audio the default with real-time text-to-speech

44%

30-50%

4.5x

940

Company overview

Real-time text-to-speech at scale

Challenge

Solution

Results

Chosen by the world's most ambitious builders

Explore Baseten today