Meta logoLlama 2 7B Chat TRT-LLM

A seven billion parameter foundation model tuned for chat use cases with a low latency TensorRT-LLM model server.

Deploy Llama 2 7B Chat TRT-LLM behind an API endpoint in seconds.

Deploy model

Example usage

Streaming Token Example

This code example shows how to stream the output tokens as they get generated using Python. The model has three main inputs:

  1. prompt: The input text sent to the model.

  2. stream: Setting this to True allows you to stream the tokens as they get generated.

  3. max_tokens: Allows you to control the length of the output sequence.

Because this code example streams the tokens as they get generated, it does not produce a JSON output.

Input
1import requests
2import os
3
4# Replace the empty string with your model id below
5model_id = ""
6baseten_api_key = os.environ["BASETEN_API_KEY"]
7
8data = {
9    "prompt": "What is a llamas favorite food?",
10    "stream": True,
11    "max_tokens": 512
12}
13
14# Call model endpoint
15res = requests.post(
16    f"https://model-{model_id}.api.baseten.co/production/predict",
17    headers={"Authorization": f"Api-Key {baseten_api_key}"},
18    json=data,
19    stream=True
20)
21
22# Print the generated tokens as they get streamed
23for content in res.iter_content():
24    print(content.decode("utf-8"), end="", flush=True)
JSON output
1[
2    "Llamas",
3    "are",
4    "herbivores",
5    "..."
6]

Non-Streaming Example

If you don't want to stream the tokens simply set the stream parameter to False.

The output of the model is a JSON object which has a key called text that contains the entire generated text.

Input
1import requests
2import os
3
4# Replace the empty string with your model id below
5model_id = ""
6baseten_api_key = os.environ["BASETEN_API_KEY"]
7
8data = {
9    "prompt": "What is a llamas favorite food?",
10    "stream": False,
11    "max_tokens": 512
12}
13
14# Call model endpoint
15res = requests.post(
16    f"https://model-{model_id}.api.baseten.co/production/predict",
17    headers={"Authorization": f"Api-Key {baseten_api_key}"},
18    json=data
19)
20
21# Print the output of the model
22print(res.json())
JSON output
1{
2    "text": "What is a llamas favorite food?\nLlamas are herbivores, which means they eat plants for food. Their favorite foods are grasses, hay, and other types of vegetation. They also enjoy fruits and vegetables, such as apples and carrots. Some llamas may also enjoy treats, such as sugar cubes or peanut butter."
3}

Deploy any model in just a few commands

Avoid getting tangled in complex deployment processes. Deploy best-in-class open-source models and take advantage of optimized serving for your own models.

$

truss init -- example stable-diffusion-2-1-base ./my-sd-truss

$

cd ./my-sd-truss

$

export BASETEN_API_KEY=MdNmOCXc.YBtEZD0WFOYKso2A6NEQkRqTe

$

truss push

INFO

Serializing Stable Diffusion 2.1 truss.

INFO

Making contact with Baseten 👋 👽

INFO

🚀 Uploading model to Baseten 🚀

Upload progress: 0% | | 0.00G/2.39G