Engineering
Product
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference
Our new Speculative Decoding integration can cut latency in half for production LLM workloads.
Model performance
How to build function calling and JSON mode for open-source and fine-tuned LLMs
Use a state machine to generate token masks for logit biasing to enable function calling and structured output at the model server level.
News
Introducing function calling and structured output for open-source and fine-tuned LLMs
Add function calling and structured output capabilities to any open-source or fine-tuned large language model supported by TensorRT-LLM automatically.