Company overview
Latent Health provides the largest health systems in the US with state-of-the-art clinical question answering, empowering them to truly understand their patients. Leveraging in-house LLMs and embedding models, it uses AI to answer any arbitrarily complex clinical question, accelerating time-to-treatment for patients and reducing administrative burden on providers.
Challenges
Latent’s multi-modal search and question answering operates over a mixture of structured and unstructured content, including notes, labs, medications, and media.
Effective multi-modal question answering requires using an ensemble of models and processing steps, including optical character recognition (OCR) to process documents, fine-tuned LLMs for question-answering, and custom embedding and reranker models to power retrieval.
To serve their compound workflows in production, the Latent team deployed their models directly via a large cloud service provider (CSP). But as their workloads began to scale, they experienced pain on multiple fronts:
Complexity of infrastructure management: The Latent engineering team was spending an increasing amount of time managing infrastructure, when they wanted to focus on developing their core product.
Difficulty managing many models: The developer experience of large cloud providers made it challenging to deploy, orchestrate, and manage the different models and processing steps Latent works with.
Reliability: The complexity around spinning up additional resources made it difficult to maintain high availability with increasing demand and left them vulnerable to single points of failure.
Performance: Latent had aggressive latency targets for question-answering. Meeting these targets required deep inference-specific expertise for pipeline orchestration.
Time to experiment: Latent wanted to quickly get insights from experiments with different model architectures. CSPs made it difficult to quickly get signal from their iterations, and complicated to deploy and experiment with different models.
Without the ability to autoscale parts of their workflow independently, Latent found it difficult to optimize both latency and cost, a challenge that grew as their volume increased to millions of documents daily. The Latent team engaged with Baseten to abstract away the complexity of their inference infrastructure, simplify deployment and scaling, while optimizing their performance and reliability.
Solutions
Baseten Chains for model orchestration and efficiency
Our forward deployed engineers worked closely with Latent’s engineering team to optimize their compound AI workflow. To remove processing bottlenecks, Latent uses Baseten Chains, our SDK for building and deploying ultra-low-latency compound AI systems.
Chains abstracts away the complexity of deploying compound AI systems. Different models and processing steps are deployed as atomic components called Chainlets, which call each other directly, can be deployed on different hardware, and scale independently. While their AI models run on more powerful GPUs, the processing steps for OCR run on CPUs to improve cost-efficiency and scale independently to quickly make incoming patient data searchable.
Model performance and reliability
By removing bottlenecks via decoupled autoscaling, Chains has a huge impact on latency, GPU utilization, and cost-efficiency. Baseten’s engineers also worked with their team to make additional performance improvements at the runtime level using the Baseten Inference Stack, especially via inference framework optimizations (using TensorRT-LLM vs. vLLM).
The Baseten platform also powers leading reliability with seamless autoscaling and optimized cold starts to accommodate traffic spikes, and cross-cluster autoscaling to reduce vulnerability to hardware failures.
Developer experience
The Baseten platform puts developer experience on an equal footing with model performance and reliability. Baseten makes it easy for organizations to manage and iterate on many different models in production, with a user-friendly UI, comprehensive observability, detailed logging, custom metrics and health checks, tracing, and more.
Ease of transition
Since the Baseten platform is HIPAA compliant, OpenAI-compatible, and cloud-agnostic, Latent could quickly and painlessly make the switch over from their previous solution.
Results
Using Chains for their compound AI systems, the Basten platform for model management, and the Baseten Inference Stack for performance and reliability, Latent has achieved:
99.999% uptime in the years it partnered with Baseten
600 millisecond P90 end-to-end latency for question-answering
6x improved GPU utilization by removing processing bottlenecks
“Baseten has saved us countless hours of experimentation and eliminated the stress of working about inference reliability.
What’s next
Baseten is constantly improving its developer experience to make it even easier to manage, observe, and iterate on individual models and compound AI systems in production. Baseten continues to partner closely with Latent’s team to ensure they have the performance, uptime, and ease-of-use they need to power the leading pharmacy platform in the US.
Check out Latent’s website to see how it optimizes pharmacy workflows and accelerates time-to-therapy for patients.