Superhuman achieves 80% faster embedding model inference with Baseten

Company overview

Superhuman is the AI-native email app for productivity. Superhuman automatically triages incoming email, sets automatic follow-up reminders, unlocks team collaboration, and even makes typing faster with autocompletion, autocorrect, text expansion, and instant replies.

To deliver that experience, Superhuman relies on a suite of dozens of models ranging from closed-source models to open-source, fine-tuned, and custom-built models powering classification, search, and retrieval for a highly personalized product experience.

Challenges

Superhuman saves their users 4+ hours per week by helping them get 2x more done in their inbox. But many AI features need to be instant to avoid interrupting their users’ workflow.

Superhuman replaced off-the-shelf closed-source models with a suite of dozens of custom and fine-tuned embedding models to improve the quality of AI classification in their product. To bring these models to their users in production, they needed to spin up infrastructure with:

Consistent low latencies. Superhuman’s previous inference setup was unable to hit their P95 latency target of <500ms for their largest embedding models.
Scale for worldwide power users. With users around the world sending more than half a billion email messages, Superhuman’s inference system needed to support flexible scale without fixed compute reservations.
Support for heterogeneous models. Superhuman needed a solution that could support the models they use today and would be future-proof against changes to model architectures, parameter counts, and dimensionalities.
A lean engineering team. Superhuman’s AI engineering team specializes in training and composing AI models, not building and operating distributed GPU infrastructure and optimizing model inference runtime performance.

We have ambitious goals for a best-in-class customer experience powered by consistent low-latency inference, but I didn’t think it would be a good use of our team’s bandwidth to build an inference platform in-house.
Loïc Houssier, CTO

Solutions

To achieve Superhuman’s targets on a rapid timeline, Baseten paired our Forward-Deployed Engineers (FDEs) and model performance engineers to work directly with Superhuman’s MLEs and engineering leadership to deploy, optimize, benchmark, and scale Superhuman’s models on the Baseten Inference Stack.

We were able to complete this project in just one week and achieve Superhuman’s goals with:

Baseten Embeddings Inference (BEI). We deployed Superhuman’s embedding models on BEI — our TensorRT‑LLM–based runtime designed specifically for embeddings, re-rankers, and classifiers — to maximize throughput and minimize latency.
Autoscaling access to the latest GPUs. Superhuman gained elastic scale across regions and cloud providers with latency-aware routing and active-active reliability thanks to Baseten’s Multi-cloud Capacity Management (MCM) infrastructure.
Baseten Performance Client. Client code can hide subtle bottlenecks in end-to-end latency. The Baseten Performance Client supports high-throughput inference for embedding models while eliminating excess latency and bottlenecks.
An intuitive developer experience. Superhuman’s MLEs and software engineers immediately felt comfortable configuring and operating infrastructure via Baseten’s Python-based developer tooling.

The deployment mechanism is so good that I was able to self-serve 95% of what I needed, and the Baseten team was incredibly responsive every time I had a question.
Agustín Bernardo, Senior AI Engineer

Results

After benchmarking the performance of the Baseten Inference Stack, Superhuman’s AI engineering team quickly deployed their suite of custom models and switched over their production traffic. Our forward-deployed engineering team provided hands-on support during the migration to ensure that there would be zero user impact.

Baseten cut our P95 latency by 80% across the dozens of fine-tuned embedding models that power core features in Superhuman's AI-native email app.
Loïc Houssier, CTO

We beat Superhuman’s original latency targets while delivering:

80% lower latencies. Across models and architectures, BEI delivered an average of 80% reduction in all-important P95 latency.
Broad model coverage. Superhuman’s MLEs have deployed dozens of custom embedding models on Baseten via a unified developer interface.
Engineering time savings. Adopting Baseten rather than building in-house de-risked Superhuman’s engineering roadmap and freed up multiple engineers for essential product work.

Superhuman is all about saving time. With Baseten, we're delivering a faster product for our customers while reducing engineering time spent on infrastructure.
Loïc Houssier, CTO

What’s Next

In July 2025, Grammarly announced it would acquire Superhuman to accelerate building the AI-native productivity suite of choice. Grammarly reaches over 40 million users daily, providing a massive opportunity to scale Superhuman’s reach and impact. The team at Baseten looks forward to continuing to provide low-latency, elastic inference for Superhuman through their next phase of growth.