Build with OpenAI’s Whisper model in five minutes

As soon as I saw Whisper — OpenAI’s open-source neural network for automatic speech recognition — I knew I had to start experimenting with it. The model isn’t an incremental improvement on speech-to-text, it is a paradigm shift from “this technology could be cool one day” to “this technology has arrived.” Tested around the Baseten office, it captured not just English but Urdu, Mandarin, French, and more with stunning accuracy.

You can try Whisper for yourself with this demo app: Baseten Whisper demo

I’m a new grad software engineer, not an ML or infrastructure expert, so it was very satisfying to be able to deploy this impactful model and build an application on top of it. In this blog post, I’ll first show you how you can deploy Whisper instantly as a pre-trained model, then walk you through the steps I took to package and deploy the model myself.

Deploy Whisper instantly

If you’re as excited as I am about Whisper, you’ll want to start using it right away. That’s why we added Whisper to our pre-trained model library, so all Baseten users can deploy Whisper in seconds for free.

All you have to do is sign in to your Baseten account and follow the pre-trained model deployment instructions to build your own app powered by Whisper. Deploying the model takes just a few clicks and you’ll be up and running all but instantly.

If you don’t have a Baseten account yet, you can sign up for free here.

To invoke the model, just pass in a dictionary with a URL pointing to an MP3 file, like this:

{
  "url": "https://cdn.baseten.co/docs/production/Gettysburg.mp3"
}

That should be everything you need to get started building an application powered by Whisper. But if you’re interested in the mechanics of how I deployed this novel model, stick around for the rest of the writeup!

How I deployed Whisper

I used Truss, Baseten’s open-source model packaging and serving library, to deploy Whisper. You can see the packaged model in its entirety in this example Truss.

For a full walkthrough of the project, check out my YouTube video.

Whisper was created with PyTorch, one of Truss’ supported frameworks, but some of its dependencies were brand new. Fortunately, they were easy to add in my Truss’ configuration file.

requirements:
  - git+https://github.com/openai/whisper.git
  - --extra-index-url https://download.pytorch.org/whl/cu113
  - requests
system_packages:
  - ffmpeg

Another interesting challenge was working with GPUs to run the model. Whisper, like many large models, not only requires GPUs for model training but also for model invocation. On Baseten, running a model on a GPU is a paid feature turned on per model due to the expense of GPU compute, but in Truss signaling that a GPU is needed is as simple as a single flag.

resources:
  cpu: 500m
  memory: 3Gi
  use_gpu: true

But you’re not here for infrastructure, you’re here for awesome ML models. The heart of any model packaged as a Truss is the predict function in the model/model.py file. Let’s take a look:

def predict(self, request: Dict) -> Dict:
    with NamedTemporaryFile() as fp:
        fp.write(request["response"])
        result = whisper.transcribe(
            self._model,
            fp.name,
            temperature=0,
            best_of=5,
            beam_size=5,
        )
        segments = [
            {"start": r["start"], "end": r["end"], "text": r["text"]}
            for r in result["segments"]
        ]
    return {
        "language": whisper.tokenizer.LANGUAGES[result["language"]],
        "segments": segments,
        "text": result["text"],
    }

You’ll notice that the model is invoked on a file path. Like most models that interface with anything more complicated than strings or numbers, such as audio in this case, Whisper relies on pre-processing work to turn the input into something it can use. With Truss, pre- and post-processing functions are bundled with the model invocation code in the same file.

def preprocess(self, request: Dict) -> Dict:
    resp = requests.get(request["url"])
    return {"response": resp.content}

def postprocess(self, request: Dict) -> Dict:
    return request

After I put together the Truss of the Whisper model, it was time to deploy. Getting the model on Baseten was as simple as calling baseten.deploy(my_truss) on the Truss of the Whisper model. From there, I enabled GPU access for the model and it was ready to go!

Building the demo app for Whisper was super easy. I just used Baseten’s drag-and-drop view builder to add a microphone component, text box for output, and various other UI elements. Then, I created a button and used it to call a worklet to invoke the deployed model. But it is just a simple app and I look forward to experimenting further with the Whisper model. I want to see what you build with it too! Please send me any ideas or neat demos at support@baseten.co