GPT vs Mistral: Migrate to open source LLMs seamlessly

TL;DR

If you’re using the ChatCompletions API and want to experiment with open source LLMs for your generative AI application, we’ve built a bridge that lets you try out models like Mistral 7B with just three tiny code changes.

In the past few months, researchers have released powerful open source large language models like Mistral 7B and Zephyr. Open source LLMs demonstrate strong capabilities on tasks from chat completion to code generation, and many can be run on cost-effective hardware.

One barrier to adopting open source ML models is the time it takes to re-integrate the new model into your application. Models have different input and output formats, support different parameters, and require different prompting strategies.

To make it easier to experiment with open source models, we’ve created a new endpoint for LLMs hosted on Baseten that’s compatible with OpenAI’s ChatCompletions API. With this endpoint and a supported model, you can go from GPT-3.5 to open source LLMs like Mistral 7B with:

  • One-click model deployment.

  • Zero pip install commands.

  • Three tiny code changes.

Follow along with the video above or the tutorial below and you’ll be working with open source LLMs in no time!

Deploy Mistral 7B

We prepared versions of Mistral 7B and Zephyr, two recent LLMs that are good for chat use cases, in our model library.

Let’s deploy Mistral 7B for this tutorial:

  1. Select Mistral 7B Chat from the model library.

  2. Click “Deploy on Baseten”

  3. Get your model ID by opening the “Call model” modal and copying it from the model endpoint.

  4. Create an API key for your Baseten workspace with the “Generate API key” button.

âś•
The Model ID in this example is "5wom4xnq"

You’ll need the API key and model ID to call the model endpoint in the next step.

Note that with open source models deployed on Baseten, you’re charged per minute of GPU usage rather than per token. Your Baseten account comes with free credits to fund experimentation, and all models deployed from the model library automatically take advantage of scale to zero with fast cold starts to save you money while the model is not in use.

Update your model usage script

Here’s the fun part of the project: you can drop this model into existing code that uses the ChatCompletions API with only minor changes.

In fact, the code will still use the OpenAI Python client, so you don’t need to install any new libraries. Let’s take a look at some code, we’ll cover the differences below.

Standard inference

Here’s a code sample for using OpenAI’s ChatCompletions API with GPT-3.5:

1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6    api_key=os.environ["OPENAI_API_KEY"]
7)
8
9response = client.chat.completions.create(
10    model="gpt-3.5-turbo",
11    messages=[
12        {"role": "user", "content": "Who won the world series in 2020?"},
13        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
14        {"role": "user", "content": "Where was it played?"}
15    ]
16)
17
18print(response.choices[0].message.content)

And here’s the same code sample for Mistral on Baseten:

1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6    # api_key=os.environ["OPENAI_API_KEY"],
7    api_key=os.environ["BASETEN_API_KEY"],
8    # Add base_url
9    base_url="https://bridge.baseten.co/{model_id}/v1"
10)
11
12response = client.chat.completions.create(
13    # model="gpt-3.5-turbo",
14    model="mistral-7b",
15    messages=[
16        {"role": "user", "content": "Who won the world series in 2020?"},
17        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
18        {"role": "user", "content": "Where was it played?"}
19    ]
20)
21
22print(response.choices[0].message.content)

Rather than making you play spot-the-difference, we’ll highlight the three small code changes that make this work:

  1. Replace the OPENAI_API_KEY with your BASETEN_API_KEY in the client object.

  2. Set the base_url in the client object to https://bridge.baseten.co/{model_id}/v1 where {model_id} is the ID of your deployed model.

  3. In the client.chat.completions.create() call, set model to mistral-7b instead of gpt-3.5-turbo.

The response format will be exactly the same, though token usage values will not be calculated. The endpoint reference docs have complete information on supported inputs and outputs.

Streaming inference

LLMs on Baseten can also support streaming. To stream Mistral responses, pass stream=True in the ChatCompletion API call, and parse the streaming response as needed:

1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6    # api_key=os.environ["OPENAI_API_KEY"],
7    api_key=os.environ["BASETEN_API_KEY"],
8    # Add base_url
9    base_url="https://bridge.baseten.co/{model_id}/v1"
10)
11
12response = client.chat.completions.create(
13    # model="gpt-3.5-turbo",
14    model="mistral-7b",
15    messages=[
16        {"role": "user", "content": "Who won the world series in 2020?"},
17        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
18        {"role": "user", "content": "Where was it played?"}
19    ],
20    stream=True
21)
22
23for chunk in response:
24    print(chunk.choices[0].delta)

Explore open source models

In addition to the LLMs listed in this piece — Mistral 7B Chat and Zephyr Chat — we're implementing OpenAI API compatibility across more models, like Mixtral 8x7B Chat. If there’s another LLM you’d like to try with the OpenAI client, you can adapt the Mistral model or let us know what you need at support@baseten.co. 

And while this bridge makes it easier to get started with LLMs like Mistral, there’s a wide world of open source models to explore. Our model library hosts dozens of curated open source models ranging from LLMs to models like Stable Diffusion XL and Whisper. Get started with our checklist for switching to open source or dive deeper with our guide to open source alternatives for ML models.