Run Qwen3 Embedding on NVIDIA Blackwell GPUs

Baseten Embeddings Inference on Blackwell GPUs (B200s) delivers over 3x higher throughput compared to TEI and vLLM for production embedding models at scale.

Run Qwen3 Embedding on NVIDIA Blackwell GPUs with Baseten Embeddings Inference (BEI)

We’re excited to announce Baseten Embeddings Inference (BEI) for Blackwell GPUs. This means you can take advantage of the latest open-source embeddings models, such as Qwen3 Embedding, and the latest NVIDIA GPUs, while using the most performant embeddings inference engine: BEI.

To illustrate the unlocked potential, we’ve run benchmarks using the Qwen3 Embedding 8B model. This is the largest of the Qwen3 Embedding series, which has the multilingual capabilities, long-text understanding, and reasoning skills of its foundation model counterpart.

As of this writing, Qwen3 Embedding 8B is #1 on the Massive Text Embedding Leaderboard, with a mean task score of 70.58%.

BEI provides the fastest embeddings inference on B200s

In terms of throughput on a high query-throughput test (500 tokens/request), running Qwen 3 8B Embedding with BEI on B200s can process 1.5x more tokens per second compared to BEI running on an H100 (the next best solution), 3.3x more than vLLM, and 3.6x more than TEI on H100.

On a high query-throughput test (500 tokens per request), BEI on B200s gives 3.3x higher throughput than vLLM and 3.6x higher throughput than TEI running on H100s.On a high query-throughput test (500 tokens per request), BEI on B200s gives 3.3x higher throughput than vLLM and 3.6x higher throughput than TEI running on H100s.

On a low query-throughput test (5 tokens per request), BEI on B200s has 8.4x higher throughput than vLLM and 1.6x higher than TEI.

On a low query throughput test (5 tokens per request), BEI on B200s has 8.4x higher throughput than vLLM and 1.6x higher than TEI.On a low query throughput test (5 tokens per request), BEI on B200s has 8.4x higher throughput than vLLM and 1.6x higher than TEI.

You can see the full benchmark results in the table below.

High-throughput, low-latency embeddings inference in production

To get started with Qwen3 Embedding 8B, you can deploy it via our Model Library. (Note: this link deploys on H100 MIG by default, but you can change the instance type to B200.)

If you’re running embedding models or compound AI systems at scale, reach out to learn how we can optimize your workloads. In the meantime, you can also check out our technical deep dive on how we optimized BEI, or our docs for more information!

Subscribe to our newsletter

Stay up to date on model performance, GPUs, and more.