Stable Diffusion is an open-source image generation model developed by Stability AI. It goes image for image with Dall·E 2, but unlike Dall·E’s proprietary license, Stable Diffusion’s usage is governed by the CreativeML Open RAIL M License. While this dramatically lowers the cost of using the model, it still requires some technical aptitude to get it running, not to mention a high-end GPU.
I wanted to give stable diffusion a try, but didn’t want to send Nvidia a thousand dollars for a shiny new 12-gig 3080Ti to run it on. Plus, I wanted my colleagues to be able to generate images too. So I set out to deploy the model on Baseten.
Spoiler alert: it worked. Here’s an app that you can use to interact with the deployed model.
If you’re curious about the process of deploying a cutting-edge model on Baseten, read on and I’ll walk you through the process step by step.
While Stable Diffusion is a good deal simpler to run than many other big models, it still takes a few resources:
- If you’re deploying the model to Baseten, you’ll need both a Baseten API key and GPU access in your workspace. If your workspace uses a paid plan, contact us to get GPU access.
- If you want to serve the model locally or on another platform, you’ll need access to a machine with a CUDA-capable GPU
- A Hugging Face access token with access to the Stable Diffusion model (once you have created the access token, visit that link and register to use the model, access will be granted instantly)
- The diffusers and baseten packages from PyPi installed
We don’t need to download the model weights directly or otherwise use the build instructions from the stable diffusion GitHub repository as we’ll get everything we need through the Hugging Face model.
Packaging the model
I used Truss, Baseten’s open source package for serving and deploying models, to prepare the model for production.
To create the Truss, I opened up Terminal and ran:
This created the folder structure that I used to package up the model. I edited two files from inside this folder: config.yaml and model/model.py.
In the config file, I made three adjustments. First I added package dependencies:
Then I made sure to configure the Truss to use a GPU. Stable diffusion requires a GPU during inference, not just training, to generate images.
Finally, I used Truss’ secrets management feature to make sure that my model knows to look for the Hugging Face access token on Baseten.
Then it was time to work on the model code itself. Truss allows you to quickly define a model/model.py file and then turns that into a Docker image that contains a server that hosts your model. Here is the full model/model.py file, which I’ll explain in detail below.
Let's break that big chunk of code down.
In the init function, the device parameter lets the model access a GPU when available. The function also loads config information and secrets.
The load function loads the stable diffusion model. This uses your Hugging Face access token from the prerequisites section to access the weights. The model is then pushed the the GPU, when available.
A helper function converts the Image object to a base64 string.
Finally, the predict function actually runs interference on the model. The predict function parses out the prompt from the request, runs the prompt through the model, and does some post-processing on the resulting Image object. After running the result through the base64 helper function, it returns a response with the encoded image. Here's the line that actually invokes the model:
Stable Diffusion uses Hugging Face and PyTorch, which are both supported frameworks on Truss and Baseten. So it only takes a few lines of code to load and run inference on the model in production.
Serving and deployment
Before deploying the model, I served it locally to make sure everything was working as expected. I used the Truss library in a Jupyter notebook to invoke the model:
Satisfied that it was working, I deployed it to Baseten with just a couple lines of code:
With the model deployed, I used the application builder to add a simple user interface. The demo app takes a prompt and returns an image, letting anyone try stable diffusion without writing a line of code. Try the app and let your creativity run wild!
Choosing the right horizontal scaling setup for high-traffic models
Horizontal scaling via replicas with load balancing is an important technique for handling high traffic to an ML model. Let’s examine three tips for understanding how to properly replicate your instances to save users time without wasting your money.