Engineering

Bryce Dubayah

Product

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

3 others
Model performance

How to build function calling and JSON mode for open-source and fine-tuned LLMs

Use a state machine to generate token masks for logit biasing to enable function calling and structured output at the model server level.

News

Introducing function calling and structured output for open-source and fine-tuned LLMs

Add function calling and structured output capabilities to any open-source or fine-tuned large language model supported by TensorRT-LLM automatically.

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.