Playground v2 vs Stable Diffusion XL 1.0 for text-to-image generation

TL;DR

Playground v2 is a new text-to-image model by Playground that rivals Stable Diffusion XL (SDXL) in speed and quality. Trained from scratch, the Playground v2 model is similar to SDXL but has a more opinionated aesthetic, like a AAA video game. Which text-to-image model to use in your application depends on your use case and artistic preferences.

Playground v2 by Playground, released in December 2023, is a commercially-licensed text-to-image model with open weights. Like Stable Diffusion XL 1.0 (SDXL), it takes simple text prompts and creates high-quality images at a 1024x1024-pixel resolution.

You can deploy Playground v2 in just two clicks from our model library.

In this article, we’ll compare output images head-to-head for quality, then discuss the speed and hardware requirements of each model. Here are the key results from my experimentation:

  • Visually, Playground v2 is the more consistent model. It has a specific style and creates similar compositions for repeated prompts. SDXL offers a wider variety of images: some better, more worse.

  • Inference speed is comparable between the models for a standard 50-step run, and most Stable Diffusion optimizations should be possible for Playground v2 given their similar architectures.

  • I ran my experiments on an A10G GPU. Both models can benefit from the extra power of the A100, but run very well on the A10 and can even fit on the less-expensive T4 if necessary.

I have used SDXL extensively and I’m a big fan of the model — every header image you see on this blog was generated with SDXL. Well, that’s not quite true. The header image for this post was all Playground v2, and I expect I’ll be using it pretty often moving forward.

Head-to-head image quality showdown

In their release announcement, Playground states that in their testing across thousands of users, Playground v2 won head-to-head vs SDXL output approximately 70 to 75 percent of the time depending on the prompt set used. Evaluating model output is an evolving and imprecise science — especially for image generation where people have different artistic preferences. While we will be looking at a tiny sample relative to Playground’s extensive testing, these examples should help you pick the right model for your use case.

Both SDXL and Playground v2 create 1024x1024 pixel images from simple prompts. In this section, I’ve compared the results for six prompts designed to show both the strengths and the weaknesses of each model. For each prompt, I ran both models three times and included all three output images.

Image showdown: landscapes

Prompt: A scenic mountain landscape

This example shows Playground v2’s consistent framing and bright colors, as well as SDXL’s variety of styles and framings for the same prompt. To my eye, both are great: the Playground v2 images look like stills from a AAA video game, while the SDXL images look like oil paintings from a museum.

Image showdown: people

Prompt: A wizard in a library

On a simple prompt where facial features and body lines are mostly covered by a hat, robe, and flowing beard, the two models have comparable performance on image details, both struggling occasionally with hands. The background remains bolder for Playground v2 than for SDXL by default, though with a more detailed prompt SDXL can make brighter images.

Prompt: A woman playing the saxophone in a jazz club

This is an intentionally difficult prompt. Image generation models have always struggled with hands and faces, though SDXL has improved with the introduction of the refiner model. And something as visually complex as a saxophone is hard for any model to render.

Overall, the model output is not bad. I personally like the middle SDXL output the best, but Playground v2 consistently puts out high quality images, though still with imperfect details given the difficult prompt.

Image showdown: challenge round

Continuing the theme of difficult prompts, let’s try prompts that I don’t expect either model to do well on and see what happens.

Prompt: The meaning of life

Text-to-image models generally require concrete prompts to create quality output. Given something super abstract, such as the meaning of life, Playground v2 creates variations on a single composition, while SDXL has a much more random output.

Prompt: Elegantly dressed people dancing in an intricately detailed palace ballroom with a mosaic tile floor, stained glass windows, and a ceiling painted in fresco

With only 1024 pixels square for output, image generation models struggle to create highly detailed images. This prompt asks for detail in the floor, windows, and ceiling, which neither model can accomplish. Playground v2 attempts to maintain photorealism, while SDXL reverts to an oil painting style to capture an impression of the scene.

Prompt: Three cats in a living room. The first cat is sitting on top of an orange sofa. The second cat is chasing a ball of yarn. The third cat is lying underneath a green armchair

The final challenge for our models is a highly complicated prompt with multiple subjects, objects, and specified colors. While the models generally understand the relationships between two or three objects, they both have trouble orienting nearly a dozen objects in a scene. Of the two models, Playground v2 does a much better job of composing a realistic scene, though both models retain key elements (color scheme, three cats) while discarding many details (orange chair, position of cats).

Playground v2 vs SDXL: image generation speed

On an A10G, both Playground v2 and SDXL took 12 to 15 seconds to generate an image with 50 inference steps without any optimizations.

We can get a lot faster than that: as fast as 2 seconds for SDXL with traditional inference or even faster with latent consistency models or SDXL Turbo. As Playground v2 is a much newer model than SDXL, these optimizations don’t exist yet, but when they do we can expect similar performance from Playground v2.

Playground v2 vs SDXL: GPU requirements

The safetensors model weights files for Playground v2 and SDXL are both 6.94 GB at FP16. As such, each model can fit on GPUs with as little as 16 gigs of VRAM, such as the T4, for hosted inference. However, the A10G and A100 provide faster inference.

Picking the right GPU comes down to the tradeoff between power and cost. Take a look at our breakdowns of the T4 vs A10 and A10 vs A100 to pick the best GPU for your project.

By default, Playground v2 and SDXL from the model library are deployed on A10G instances.

Get started with Playground v2

If you need consistent high-quality images and like Playground v2’s video game aesthetic, you can get started with the model today by deploying it in two clicks from the model library.

And if you prefer the wider range of SDXL, we also have Stable Diffusion XL in our model library for immediate deployment. And if you want to speed up inference, check out our guide to speeding up SDXL image generation.