Deploy Falcon-40B on Baseten

TL;DR

Falcon-40B is an impressive LLM that made it to the top of the OpenLLM leaderboard. It can be used commercially but requires serious hardware to run. You can deploy Falcon-40B on Baseten using this Truss model package.

Editor's note: If you're looking for an even more powerful midsize open source LLM, check out Mixtral 8x7B.

Falcon-40B is an LLM released by Technology Innovation Institute (TII) in Dubai that’s been causing waves in the AI community for making it to the top of the OpenLLM leaderboard. The benchmarks that have come out put Falcon-40B in the ranks of GPT-3.5, and from an anecdotal point of view, are very impressive.

We deployed Falcon-40B on two A100s on Baseten this week and were amazed. Here are a few example prompts and responses that impressed us:

Prompt 1: Recipe

‍Write a recipe for almond tea cake

Response:

Ingredients:

  • 1 1/2 cups unsalted butter

  • 1 1/2 cups granulated sugar

  • 1/2 cup packed light brown sugar

  • 3 large eggs

  • 2 1/4 cups almond flour

  • 1/2 cup all-purpose flour

  • 1/4 cup cornstarch

  • 2 teaspoons vanilla extract

  • 1/2 teaspoon almond extract

Instructions:

  1. Preheat oven to 350°F. Line a 9-inch springform pan with parchment paper.

  2. Beat butter, granulated sugar, brown sugar, and eggs in a large bowl with an electric mixer on medium speed until light and fluffy, about 5 minutes. Add almond flour, all-purpose flour, cornstarch, vanilla extract, and almond extract, and beat until just combined, about 1 minute.

  3. Pour batter into prepared pan and smooth the top with a spatula.

  4. Bake for 50 minutes to 1 hour, until a toothpick inserted in the center of the cake comes out clean. Let cool for 10 minutes, then remove the sides of the springform pan and cool cake on wire rack for 2 to 3 hours.

  5. Serve and enjoy!

Prompt 2: Fiction

‍Come up with a movie plot about Falcons scheming to take over the world

Response:

In the movie, Falcons are portrayed as highly intelligent and cunning creatures who have a hidden agenda of taking over the world. They use their natural abilities to plan and execute nefarious activities that are seemingly impossible to detect. From breaking into top-secret facilities and stealing sensitive military information to influencing politicians to change laws in their favor, they will stop at nothing to achieve their ultimate goal of world domination.

Prompt 3: Code

‍Write a Flask Endpoint for receiving a stripe webhook and printing it

Response:

To receive Stripe webhooks in Flask, you can use the Flask library to handle POST requests and print the data for logging or debugging purposes. Here's an example endpoint that listens to a certain endpoint URL and prints the received data:

1from flask import Flask, request
2
3# Initialize app
4app = Flask(__name__)
5
6# Define url that webhook should be sent to
7WEBHOOK_ENDPOINT_URL = '/stripe/webhook'
8
9@app.route(WEBHOOK_ENDPOINT_URL, methods=['POST'])
10def webhook_handler():
11    # Parse the received data
12    payload = request.json.decode('utf-8')
13
14    # Print the payload for debugging or logging
15    print(payload)
16
17    # Return a success response
18    return 'OK'
19
20# Run the app
21if __name__ == '__main__':
22    app.run()

Now when a Stripe webhook request is sent to your endpoint URL, it will be received and handled by your Flask endpoint, which will parse the payload and print it to the console. You can modify the endpoint to perform any additional processing or storage of the received data as per your needs.

Using Falcon-40B yourself

Unlike some of the other top open source LLMs, Falcon-40B can be used for commercial purposes. That’s easier said than done though. Say you want to use Falcon-40B in an application that you’re building. It’s tough! It’s a big model, so getting reasonable performance requires serious hardware (the elusive A100s). And using Falcon-40B in a real app means you need monitoring, easy scaling, logs, and other features that you’d generally need to run something in prod. 

To make that easier, we packaged the model with Truss, the open source package we use for deploying models, so you can now quickly deploy Falcon-40B on Baseten. 

Once you have a Baseten account, clone this Github repo and run the following to deploy Falcon-40B:

pip install --upgrade truss
git clone https://github.com/basetenlabs/truss-examples
cd falcon/falcon-40b-truss
truss push

Once you deploy Falcon-40B, you’ll get a REST API endpoint that you can use to get responses from the model. You’ll also get the full power of Baseten’s model serving infrastructure, including logging, monitoring, and autoscaling, powering your model. 

Falcon-40B needs to run on two A100s GPU, which can get expensive quickly. However, models like Mixtral 8x7B offer even better performance on a single A100, and 7 billion parameter alternatives like Mistral 7B run on just one A10G.

Read more about Falcon-40B: