Scaled Cognition offers ultra-fast AI agents you can trust

40%

lower end-to-end latency

<120

ms TTFT

100s

of experiments

Company overview

Scaled Cognition powers enterprise-grade agentic workflows with its proprietary Agentic Pretrained Transformer (APT) model family. Recently, the Scaled Cognition team’s first agentic system, APT-1, topped agentic leaderboards, outperforming existing frontier models from OpenAI, Anthropic, and Google. They have even more breakthroughs already in motion.

Scaled Cognition’s Agent Builder platform enables developers to create, test, and deploy agents in under an hour, with support for both text and voice in multiple languages. 

Inference challenges

Winning on model performance

Quality has always been fundamental to the Scaled Cognition team, but for customers to adopt their agentic models and systems in production, having the fastest possible model performance became essential. As enterprise customers evaluated multiple infrastructure options, Scaled Cognition prioritized delivering world-class latency and reliability to match its deterministic model performance.

Scaled Cognition has always been known for the quality of its agents, but performance is just as important. We partner with Baseten to ensure the lowest possible latency for our models and agentic workflows. It’s been a major differentiator for us in the market.
Jordan DeLoach, VP of Engineering

Quick implementation was key to maintaining momentum with existing prospects, so Baseten’s engineers collaborated with Scaled Cognition to benchmark various solutions that leveraged their existing workloads and rapidly optimized them for active POCs. 

Dynamic scaling to meet customer demand

Scaled Cognition sought more adaptive infrastructure to support rapid growth and dynamic scaling, ensuring consistent performance and efficiency across fluctuating workloads. This requirement became more pressing as its first big model launch approached.

Scaled Cognition also wanted to leverage its existing AWS investments as part of a hybrid deployment strategy. The team was looking for a partner that could provide dynamic scaling for their launch, without wasting the resources they had already invested in.

Solutions

Scaled Cognition collaborated with Baseten’s inference engineers to integrate the Baseten Inference Stack, enabling rapid optimization for ultra-low latency and scalable deployment of its agentic workloads. Baseten’s forward deployed engineers (FDEs) swarmed within Scaled Cognition’s strict timeline for their launch and ongoing prospect demos.

To meet the Scaled Cognition team’s strict timelines, Baseten’s engineers provided:

  • Custom Server configurations to deploy proprietary model environments while unlocking horizontal scale, and without compromising control or compliance.

  • A hybrid hosting solution that leverages existing AWS investments, with additional spillover capacity available on Baseten Cloud.

  • Geographically strategic compute allocation for further latency reduction.

  • Early access to new GPU types to help Scaled Cognition stay at the bleeding edge of model performance.

For Scaled Cognition, a key performance benchmark was time to first token (TTFT), as it directly influences user-perceived responsiveness in agentic workflows. With intelligent request routing and massive horizontal scale, Baseten optimized Scaled Cognition’s TTFT beyond its competition. 

By using a hybrid deployment model, Scaled Cognition could flexibly scale models across its AWS environments – essential for ensuring sufficient capacity for its agent launch. Baseten Cloud is SOC 2 Type II and HIPAA compliant, making it suitable for sensitive agentic workloads.

Results

The Baseten team was able to quickly implement a tailored solution in time for Scaled Cognition’s launch and active POCs, leading to: 

  • Dynamic autoscaling for their agent launch with on-demand flex capacity.

  • 40% lower overall latency.

  • <120 ms time to first token.

Baseten’s dynamic autoscaling also helps Scaled Cognition’s researchers experiment with hundreds of experiments at once — cost-effectively — as they continue to iterate on APT-1. Custom Servers let Scaled Cognition plug Baseten’s inference platform directly into its research workflows while still using its own optimizations to eliminate hallucinations and improve agent quality.

We really appreciate the collaboration with Baseten. The Hybrid Cloud solution and access to cutting-edge GPUs while working with our existing AWS commitments were key to our success on launch day and beyond. Beyond that, Baseten’s developer experience has been a favorite across our team.
Jordan DeLoach, VP of Engineering

What’s next

Scaled Cognition recently announced its partnership with Genesys to power the Genesys Cloud™ platform, which reaches more than 8,000 customers. In addition to the Genesys partnership, Scaled Cognition is now powering some of the largest BPOs and global brands directly.

With this growth, Baseten is a close partner for Scaled Cognition’s engineers as their Agent Builder platform and agentic models continue to gain momentum, providing the high performance and reliable compute access necessary as they scale.