Run Qwen3 Embedding on NVIDIA Blackwell GPUs

We’re excited to announce Baseten Embeddings Inference (BEI) for Blackwell GPUs. This means you can take advantage of the latest open-source embeddings models, such as Qwen3 Embedding, and the latest NVIDIA GPUs, while using the most performant embeddings inference engine: BEI.

To illustrate the unlocked potential, we’ve run benchmarks using the Qwen3 Embedding 8B model. This is the largest of the Qwen3 Embedding series, which has the multilingual capabilities, long-text understanding, and reasoning skills of its foundation model counterpart.

As of this writing, Qwen3 Embedding 8B is #1 on the Massive Text Embedding Leaderboard, with a mean task score of 70.58%.

BEI provides the fastest embeddings inference on B200s

In terms of throughput on a high query-throughput test (500 tokens/request), running Qwen 3 8B Embedding with BEI on B200s can process 1.5x more tokens per second compared to BEI running on an H100 (the next best solution), 3.3x more than vLLM, and 3.6x more than TEI on H100.

On a high query-throughput test (500 tokens per request), BEI on B200s gives 3.3x higher throughput than vLLM and 3.6x higher throughput than TEI running on H100s.

On a low query-throughput test (5 tokens per request), BEI on B200s has 8.4x higher throughput than vLLM and 1.6x higher than TEI.

On a low query throughput test (5 tokens per request), BEI on B200s has 8.4x higher throughput than vLLM and 1.6x higher than TEI.

You can see the full benchmark results in the table below.

High-throughput, low-latency embeddings inference in production

To get started with Qwen3 Embedding 8B, you can deploy it via our Model Library. (Note: this link deploys on H100 MIG by default, but you can change the instance type to B200.)

If you’re running embedding models or compound AI systems at scale, reach out to learn how we can optimize your workloads. In the meantime, you can also check out our technical deep dive on how we optimized BEI, or our docs for more information!

Run Qwen3 Embedding on NVIDIA Blackwell GPUs

Authors

Last updated

Share

BEI provides the fastest embeddings inference on B200s

High-throughput, low-latency embeddings inference in production

Related posts

How we made the fastest GPT-OSS on NVIDIA GPUs 60% faster

How Baseten achieved 2x faster inference with NVIDIA Dynamo

How we run GPT OSS 120B at 500+ tokens per second on NVIDIA GPUs

Explore Baseten today