The fastest Whisper — with streaming and diarization

Since 2024, we’ve been proud to offer the fastest, most accurate, and cost-efficient Whisper transcription on the market. Now, we’re excited to announce the latest iteration of our work: real-time, speaker-aware transcription that’s even faster and lower in cost.

We focus on engineering flexible solutions for production applications, not just optimizing specific models. As a result, our Whisper transcription pipeline can be customized for each use case—with or without streaming or diarization, and with a configurable number of GPUs (vs. being coupled to entire nodes as so many inference providers do).

We’re proud to power real-time transcription and diarization for products like Notion’s AI Meeting Notes.

The foundation: The fastest Whisper transcription gets even faster

Our engineers are constantly pushing the envelope on inference performance across each model modality. Since we first announced the fastest, most accurate, and most cost-efficient Whisper transcription, we’ve continued to iterate on our pipeline for our customers who use it daily in production. As of this writing, our Whisper transcription pipeline is over 2x faster than before—with 2400x real-time factor (RTF) using Whisper Large V3 Turbo running on 8 H100 MIGs.

Baseten powers the fastest transcription, with 2400x real-time factor (RTF) using Whisper Large V3 Turbo running on 8 H100 MIGs.

Transcription speed in seconds (left axis) and real-time factor (processing time / audio duration — right axis) for different numbers of H100 MIGs (x-axis). For a full node, we see 2400x RTF for transcription with Whisper Large V3 Turbo, and 1800x RTF for Whisper Large V3. Note: this does not include network latency or audio download time, which can be highly variable and will impact overall latency.

Speed is crucial, but the pipeline also has to be designed for real-world use cases. We’ve consistently heard from our customers and prospects that other inference providers bundle solutions with rigid hardware commitments. That’s why our team focused on engineering a flexible solution that can be coupled with a configurable number of GPUs for more control and flexibility for cost-sensitive use cases.

We built our Whisper transcription pipeline (including the streaming and diarization variants) on top of Baseten Chains. Chains is an SDK for compound AI systems that lets us decouple different parts of the pipeline, running each on dedicated hardware with custom autoscaling (like running the WebSocket server on a CPU vs. the Whisper model server on H100 MIGs). This delivers incredible cost-efficiency through clever engineering alone—put head-to-head with our competitors in multiple bake-offs, nobody has been able to come close.

Baseten powers the cheapest transcription on the market, 78-98% lower-cost than competitive solutions.

Built on Baseten Chains for maximum cost-efficiency, our transcription pipeline is 78-98% lower-cost than competitors’. Baseten cost per audio minute is estimated by saturating one H100 MIG and using list pricing. Pricing can vary slightly based on workload volume, autoscaling settings, and any discount from the list price for volume usage.

In addition to these performance and cost improvements on our standard Whisper transcription offering, we’ve added two additional features: streaming audio transcription for real-time use cases, and speaker annotation (a.k.a. diarization – also possible with streaming).

You can deploy our optimized Whisper model server from our model library, or reach out to talk to our engineers about leveraging it for higher-scale workloads.

The fastest Whisper transcription with streaming

For use cases that require streaming, we’ve expanded our Whisper pipeline to support real-time transcription. Our live transcription is ideal for use cases like live note-taking applications, content captioning, customer support, and any real-time, voice-powered apps.

Features include:

Configurable update cadence for the partial transcriptions delivered
Consistent real-time latency under high volume of concurrent audio streams
Automatic language detection for multilingual transcription

Built on top of our ultra-low-cost Whisper transcription pipeline, our real-time Whisper transcription offering also consistently outprices other solutions without sacrificing quality.

Baseten pricing measured using one H100 MIG supporting 20+ concurrent streams of real-time diarization for Whisper Large V3, and 40+ concurrent streams for Whisper Large V3 Turbo.

You can deploy our optimized Whisper (streaming) model server from our model library, or reach out to talk to our engineers about leveraging it for higher-scale workloads.

The fastest Whisper transcription with diarization

Diarization is an inherently hard problem to solve: it’s often unknown how many speakers will be in a dialogue, when a new speaker joins, and it’s challenging to align speaker tags with a transcript. All of this becomes even more complicated when you do diarization in real time (i.e., with streaming).

Our team swarmed to tackle this problem, building an accurate, low-cost, and insanely fast diarization pipeline for our customers’ production use cases. We developed a number of custom optimizations on top of state-of-the-art diarization libraries, along with a custom speaker assignment algorithm to accurately map the annotation results to transcripts.

As a result, we achieve the fastest, most accurate production-ready diarization on the market. Our solution is also 50-90% cheaper than competitors' while offering higher throughput with the same number of GPUs.

Diarization error rate (DER) for different diarization providers, as tested on an open-source benchmark (Picovoice) against other providers’ APIs (lower is better).

Diarization is an optional add-on to any of our Whisper transcription pipelines (with or without streaming). Currently, our diarization + transcription pipeline powers Notion’s AI Meeting Notes.

Real-time diarization

Traditional diarization pipelines require full audio context. Ours does not. We use a sliding‑window approach that produces high‑quality results in real time while preserving speaker consistency across extended streaming sessions.

Our real-time diarization is ideal for use cases like AI notetaking, live conferencing, customer support, and any speaker-aware conversational AI apps. In contrast to other solutions that use visual cues or annotative data, our approach is audio-only, with no additional inputs. Our diarization feature has been validated under heavy load, with thousands of concurrent audio streams in production, without degrading accuracy or cost-efficiency.

We’re continuously iterating to push the frontier of diarization performance. Reach out to our engineers to discuss how we can customize our speaker-aware Whisper transcription for your workload!