Company overview
Hebbia is the leader in institutional intelligence. The world's most sophisticated asset managers, investment banks, professional services firms, and Fortune 500 companies rely on Hebbia to turn massive amounts of data into fast, reliable insights. In finance, speed is a competitive advantage. Professionals driving deals and investment decisions need answers they can trust, consistently, in a secure and collaborative environment built for how they actually work.
Impact highlights
2.5x improvement in TPS and 4x improvement in TTFT using the Baseten Inference Stack
Reduced cost by over 10x by shifting to Baseten
Real time data insights through chat
Challenge
Finance professionals expect fast and accurate responses that they can rely on as they navigate competitive and tight deal processes. Hebbia needed an inference provider that could serve latency-sensitive chat traffic quickly, with high uptime, to ensure they could create a user experience befitting finance professionals working in large institutions.
As Hebbia’s user base scaled they faced limited flexibility to optimize for the latency and reliability their customers demanded.
"The reason why we came to Baseten in the first place was the latency requirements. With our bursty workloads, we got queued for our requests similar to any other user of AI. And our customers don't care about who's queuing you."
Solution
Hebbia partnered with Baseten for a dedicated deployment of an open-source LLM. To maximize the efficiency of Hebbia’s workload, Baseten utilizes KV-aware cache routing which ensures requests are routed to each instance that hold relevant cached context. This greatly reduces latency on repeat requests where users feel delays the most. Hebbia’s deployment also utilizes speculative decoding which improves token generation speed by drafting multiple tokens in parallel rather than sequentially.
With Baseten’s elastic infrastructure, Hebbia only pays for the GPUs they use rather than having to reserve fixed-capacity which greatly reduces cost. Hebbia can also shift to new open-source models (or custom LLMs) all with the same infrastructure which gives their team the flexibility to use the latest and greatest models whenever they become available. This ensures their customers always have access to the best quality outputs and can stay on the bleeding edge.
"Our customers really care about speed, reliability, and the ability to deploy custom fine-tuned models for financial services. Baseten met all three criteria so far and it’s been a boon to our business."
Result
Baseten delivered consistent, low-latency inference for Hebbia's high priority chat traffic, meeting the real-time experience their users expect in production. With Baseten, Hebbia saw a 2.5x increase in tokens per second and a 4x improvement in time to first token compared to their previous deployment. By shifting from a closed-source model provider to Baseten, Hebbia reduced inference costs by over 10x without compromising on the reliability or latency their enterprise customers depend on.