AudioGen, part of the AudioCraft family of models from Meta AI, is now available in the Baseten model library. This post will go through a high-level overview of what AudioGen is and how to quickly deploy it from the Baseten model library, as well as show off some sample outputs.
The AudioCraft family of models from Meta AI includes AudioGen, MusicGen, and EnCodec, which together comprise the latest state-of-the-art text-to-audio open source foundation models from Meta AI. AudioGen was trained on publicly available sound effects, and is capable of creating an incredible array of sounds based on simple text inputs. Accomplishing this is a huge leap forward for text-to-audio generation, given that generating high-fidelity audio is a complex task.
Both AudioGen and MusicGen are currently available on the Baseten model library. You can deploy either (or both!) directly to Baseten by clicking on the green button in the top right of the model page. There’s no need to worry about figuring out which instance types you need, as we’ve selected the most efficient GPUs for both models on Baseten (in this case it’s a single Nvidia A10 GPU).
Once your model is deployed, you can run inference either through the Baseten client or curl. AudioGen takes a list of prompts and a duration in seconds for input, and for output generates one clip per prompt, returning each clip as a base64 encoded WAV file.
We’ve started to play around with AudioGen and are impressed by the results! Below are a couple of our favorites:
Prompt: footsteps on a wooden floor
Prompt: small dog barking
Prompt: man talking, emergency vehicle siren