Lead Developer Advocate

Philip Kiely

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

Pankaj Gupta

2 others

A ghostly, glowing llama walking ahead of a real llama

Infrastructure

Evaluating NVIDIA H200 Tensor Core GPUs for LLM inference

Are NVIDIA H200 GPUs cost-effective for model inference? We tested an 8xH200 cluster provided by Lambda to discover suitable inference workload profiles.

Pankaj Gupta

1 other

News

Export your model inference metrics to your favorite observability tool

Export model inference metrics like response time and hardware utilization to observability platforms like Grafana, New Relic, Datadog, and Prometheus.

Helen Yang

2 others

Baseten's expert metrics integration lets you export inference metrics to Prometheus, Grafana Cloud, Datadog, and New Relic.

Glossary

Building high-performance compound AI applications with MongoDB Atlas and Baseten

Using MongoDB Atlas and Baseten’s Chains framework for compound AI, you can build high-performance compound AI systems.

Philip Kiely

Model performance

How to build function calling and JSON mode for open-source and fine-tuned LLMs

Use a state machine to generate token masks for logit biasing to enable function calling and structured output at the model server level.

Bryce Dubayah

1 other

News

Introducing function calling and structured output for open-source and fine-tuned LLMs

Add function calling and structured output capabilities to any open-source or fine-tuned large language model supported by TensorRT-LLM automatically.

Bryce Dubayah

1 other