Deploying Stable Diffusion on Baseten using Truss

Stable Diffusion is an open-source image generation model developed by Stability AI. It goes image for image with Dall·E 2, but unlike Dall·E’s proprietary license, Stable Diffusion’s usage is governed by the CreativeML Open RAIL M License. While this dramatically lowers the cost of using the model, it still requires some technical aptitude to get it running, not to mention a high-end GPU.

I wanted to give stable diffusion a try, but didn’t want to send Nvidia a thousand dollars for a shiny new 12-gig 3080Ti to run it on. Plus, I wanted my colleagues to be able to generate images too. So I set out to deploy the model on Baseten.

Spoiler alert: it worked. Here’s an app that you can use to interact with the deployed model.

If you’re curious about the process of deploying a cutting-edge model on Baseten, read on and I’ll walk you through the process step by step.

Prerequisites

While Stable Diffusion is a good deal simpler to run than many other big models, it still takes a few resources:

  • If you’re deploying the model to Baseten, you’ll need both a Baseten API key and GPU access in your workspace. If your workspace uses a paid plan, contact us to get GPU access.

  • If you want to serve the model locally or on another platform, you’ll need access to a machine with a CUDA-capable GPU

  • A Hugging Face access token with access to the Stable Diffusion model (once you have created the access token, visit that link and register to use the model, access will be granted instantly)

  • The diffusers and baseten packages from PyPi installed

We don’t need to download the model weights directly or otherwise use the build instructions from the stable diffusion GitHub repository as we’ll get everything we need through the Hugging Face model.

Packaging the model

I used Truss, Baseten’s open source package for serving and deploying models, to prepare the model for production.

To create the Truss, I opened up Terminal and ran:

truss init ./stable-diffusion

This created the folder structure that I used to package up the model. I edited two files from inside this folder: config.yaml and model/model.py.

In the config file, I made three adjustments. First I added package dependencies:

requirements:
- diffusers
- transformers
- torch

Then I made sure to configure the Truss to use a GPU. Stable diffusion requires a GPU during inference, not just training, to generate images.

resources:
  cpu: 500m
  memory: 512Mi
  use_gpu: true

Finally, I used Truss’ secrets management feature to make sure that my model knows to look for the Hugging Face access token on Baseten.

Then it was time to work on the model code itself. Truss allows you to quickly define a model/model.py file and then turns that into a Docker image that contains a server that hosts your model. Here is the full model/model.py file, which I’ll explain in detail below.

​​import torch
from torch import autocast
import base64
from io import BytesIO
from typing import Dict, List
from diffusers import StableDiffusionPipeline
 
class Model:
   def __init__(self, **kwargs) -> None:
       self._data_dir = kwargs["data_dir"]
       self._config = kwargs["config"]
       self._secrets = kwargs.get("secrets")
       self._model = None
       self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 
   def load(self):
       self._model = StableDiffusionPipeline.from_pretrained(
           "CompVis/stable-diffusion-v1-4",
           revision="fp16",
           torch_dtype=torch.float16,
           use_auth_token=self._secrets["hf_access_token"],
       )
       # push model onto gpu where possible
       self._model = self._model.to(self.device)
 
   # helper function to convert to b64
   def convert_to_b64(self, image):
       buffered = BytesIO()
       image.save(buffered, format="JPEG")
       img_b64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
       return img_b64
      
   def predict(self, request: Dict) -> Dict[str, List]:
       print(self.device)
       response = {}
       response["predictions"] = []
       inputs = request["inputs"]
       prompts = list(map(lambda x: x["prompt"], inputs))
       # run inference over our prompts and pull out the resulting image
       results = []
       with autocast(self.device.type):
           for prompt in prompts:
               image = self._model(prompt)["sample"][0]
               results.append(image)
       # convert images to b64
       b64_results = list(map(lambda x: self.convert_to_b64(x), results))
       response["predictions"] = b64_results
       return response

Let's break that big chunk of code down.

In the init function, the device parameter lets the model access a GPU when available. The function also loads config information and secrets.

self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

The load function loads the stable diffusion model. This uses your Hugging Face access token from the prerequisites section to access the weights. The model is then pushed the the GPU, when available.

def load(self):
    self._model = StableDiffusionPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4",
        revision="fp16",
        torch_dtype=torch.float16,
        use_auth_token=self._secrets["hf_access_token"],
    )
    # push model onto gpu where possible
    self._model = self._model.to(self.device)

A helper function converts the Image object to a base64 string.

def convert_to_b64(self, image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_b64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_b64

Finally, the predict function actually runs interference on the model. The predict function parses out the prompt from the request, runs the prompt through the model, and does some post-processing on the resulting Image object. After running the result through the base64 helper function, it returns a response with the encoded image. Here's the line that actually invokes the model:

image = self._model(prompt)["sample"][0]

Stable Diffusion uses Hugging Face and PyTorch, which are both supported frameworks on Truss and Baseten. So it only takes a few lines of code to load and run inference on the model in production.

"A robot artist, cartoon" as generated by Stable Diffusion on Baseten

Serving and deployment

Before deploying the model, I served it locally to make sure everything was working as expected. I used the Truss library in a Jupyter notebook to invoke the model:

import truss
scaffold = truss.from_directory("./stable-diffusion")
scaffold.server_predict({"inputs" : [{"prompt" : "man on moon"}]})

Satisfied that it was working, I deployed it to Baseten with just a couple lines of code:

import baseten
scaffold = truss.from_directory("./stable-diffusion")
baseten.login("paste your Baseten API key")
baseten.deploy_truss(scaffold, model_name='stable_diffusion')

With the model deployed, I used the application builder to add a simple user interface. The demo app takes a prompt and returns an image, letting anyone try stable diffusion without writing a line of code. Try the app and let your creativity run wild!  

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.

Machine Learning

NVIDIA A10 vs A100 GPUs for LLM and Stable Diffusion inference

This article compares two popular GPUs—the NVIDIA A10 and A100—for model inference and discusses the option of using multi-GPU instances for larger models.

Philip Kiely

September 15, 2023

Machine Learning

SDXL inference in under 2 seconds: the ultimate guide to Stable Diffusion optimization

Out of the box, Stable Diffusion XL 1.0 (SDXL) takes 8-10 seconds to create a 1024x1024px image from a prompt on an A100 GPU. Here’s everything I did to cut SDXL invocation to as fast as 1.92 seconds on an A100.

Varun Shenoy

August 30, 2023