Model library / Mistral AI / Mistral 7B Chat TRT-LLM

Mistral 7B Chat TRT-LLM

A state of the art seven billion parameter LLM for general chat tasks with an OpenAI ChatCompletions compatible endpoint and a low latency TRT-LLM model server.

Deploy Mistral 7B Chat TRT-LLM behind an API endpoint in seconds.

Deploy model

Performance

Throughput

0.00 tokens/s

Throughput per user

0.00 tokens/s

Time to first token

0.00 ms

Max context

0 tokens

Cost per million tokens

$0.000

Example usage

OpenAI Chat Completions Streaming Tokens Example

This code example shows how to invoke the model with the OpenAI Chat Completions API. The model has three main inputs:

messages: This is a list of JSON objects. Each of those JSON objects should have a key called role which can have the value of either user or assistant. The JSON object should also have content which is the text passed to the large language model.
stream: Setting this to True allows you to stream the tokens as they get generated.
max_tokens: Allows you to control the length of the output sequence.

Because this code example streams the tokens as they get generated, it does not produce a JSON output.

Input

1from openai import OpenAI
2import os
3
4# Replace the empty string with your model id below
5model_id = ""
6
7client = OpenAI(
8   api_key=os.environ["BASETEN_API_KEY"],
9   base_url=f"https://bridge.baseten.co/{model_id}/v1"
10)
11
12# Call model endpoint
13res = client.chat.completions.create(
14 model="mistral-7b",
15 messages=[
16   {"role": "user", "content": "What is a mistral?"},
17   {"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
18   {"role": "user", "content": "How does the mistral wind form?"}
19 ],
20 temperature=0.9,
21 max_tokens=512,
22 stream=True
23)
24
25# Print the generated tokens as they get streamed
26for chunk in res:
27    print(chunk.choices[0].delta.content)

JSON output

1[
2    "Mistral",
3    "is",
4    "a",
5    "type",
6    "..."
7]

OpenAI Chat Completions Non-Streaming Example

The code example below shows how to use the same OpenAI Chat Completions API but without token streaming. To do this simply remove stream from the API call.

The output will be the entire generated text produced by the model.

Input

1from openai import OpenAI
2import os
3
4# Replace the empty string with your model id below
5model_id = ""
6
7client = OpenAI(
8   api_key=os.environ["BASETEN_API_KEY"],
9   base_url=f"https://bridge.baseten.co/{model_id}/v1"
10)
11
12# Call model endpoint
13res = client.chat.completions.create(
14 model="mistral-7b",
15 messages=[
16   {"role": "user", "content": "What is a mistral?"},
17   {"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
18   {"role": "user", "content": "How does the mistral wind form?"}
19 ],
20 temperature=0.9,
21 max_tokens=512
22)
23
24# Print the output of the model
25print(res.choices[0].message.content)

JSON output

1{
2    "output": "<s> [INST] What is a mistral? [/INST]A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour.</s>  [INST] How does the mistral wind form? [/INST]The mistral wind forms as a result of the orographic lifting of cold air from the Alps. The cold air rises and cools, forming clouds and precipitation."
3}

REST API Token Streaming Example

Using the OpenAI Chat Completions API is optional. You can also make a REST API call using the requests library. To invoke the model using this method you need to same three inputs messages , stream, and max_new_tokens.

Because this code example streams the tokens as they get generated, it does not produce a JSON output.

Input

1import requests
2import os
3
4# Replace the empty string with your model id below
5model_id = ""
6baseten_api_key = os.environ["BASETEN_API_KEY"]
7
8messages = [
9    {"role": "user", "content": "What is a mistral?"},
10    {"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
11    {"role": "user", "content": "How does the mistral wind form?"},
12]
13data = {
14    "messages": messages,
15    "stream": True,
16    "max_new_tokens": 512,
17    "temperature": 0.9
18}
19
20# Call model endpoint
21res = requests.post(
22    f"https://model-{model_id}.api.baseten.co/production/predict",
23    headers={"Authorization": f"Api-Key {baseten_api_key}"},
24    json=data,
25    stream=True
26)
27
28# Print the generated tokens as they get streamed
29for content in res.iter_content():
30    print(content.decode("utf-8"), end="", flush=True)

JSON output

1[
2    "Mistral",
3    "is",
4    "a",
5    "type",
6    "..."
7]

REST API Non-Streaming Example

If you don't want to stream the tokens simply set the stream parameter to False.

The output is the entire text generated by the model.

Input

1import requests
2import os
3
4# Replace the empty string with your model id below
5model_id = ""
6baseten_api_key = os.environ["BASETEN_API_KEY"]
7
8messages = [
9    {"role": "user", "content": "What is a mistral?"},
10    {"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
11    {"role": "user", "content": "How does the mistral wind form?"},
12]
13data = {
14    "messages": messages,
15    "stream": False,
16    "max_new_tokens": 512,
17    "temperature": 0.9
18}
19
20# Call model endpoint
21res = requests.post(
22    f"https://model-{model_id}.api.baseten.co/production/predict",
23    headers={"Authorization": f"Api-Key {baseten_api_key}"},
24    json=data
25)
26
27# Print the output of the model
28print(res.json())

JSON output

1{
2    "output": "<s> [INST] What is a mistral? [/INST]A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour.</s>  [INST] How does the mistral wind form? [/INST]The mistral wind forms as a result of the orographic lifting of cold air over the southern slopes of the Alps. The cold air rises, cools, and condenses, forming clouds."
3}

Performance

Throughput

Throughput per user

Time to first token

Max context

Cost per million tokens

Example usage

OpenAI Chat Completions Streaming Tokens Example

OpenAI Chat Completions Non-Streaming Example

REST API Token Streaming Example

REST API Non-Streaming Example

Deploy any model in just a few commands