Stream model output

Baseten now supports streaming model output. Instead of having to wait for the entire output to be generated, you can immediately start returning results to users with a sub-one-second time-to-first-token.

Streaming Llama-2 output

Streaming is supported for all model outputs, but it’s particularly useful for large language model (LLM) output for three reasons:

  1. Partial outputs are actually useful for LLMs. They allow you to quickly get agood sense of what kind of response you’re getting. You can tell a lot from the first few words of output.

  2. LLMs are often used in chat and other applications that are latency-sensitive. User’s expect a snappy UX when interacting with an LLM. 

  3. Waiting for complete responses takes a long time, and the longer the LLM’s output, the longer the wait.

With streaming, the output starts almost immediately, every time.

For more on streaming with Baseten, check out an example LLM with streaming output and a sample project built with Llama 2 and Chainlit.