Introducing Baseten Loops: A Training SDK for Frontier RL. Learn more here

Sub-second image generation with Flux.2 and Qwen-Image

2.3x faster Flux.2 [dev] and 1.6x faster Qwen-Image on NVIDIA Blackwell GPUs using quantization, optimized attention kernels, and runtime improvements.

Sub-second image generation with Flux.2 and Qwen-Image
TL;DR

We optimized image generation serving for Flux.2 [dev] and Qwen-Image, delivering 2.3x and 1.6x speedups over SGLang on NVIDIA Blackwell GPUs, and 1.9x and 1.1x speedups on NVIDIA Hopper GPUs.

Image generation has become a core inference workload for creative tools, design workflows, marketing applications, and various AI-native products. Like video generation, it relies on an iterative denoising process in latent space, repeatedly refining an image over multiple sampling steps. These workloads are highly latency-sensitive and usually run at very small batch sizes, so improving single-request latency directly improves user experience, throughput, and cost efficiency. 

In this post, we show how the Baseten Inference Stack accelerates image generation for Flux.2 [dev] and Qwen-Image on NVIDIA B200 and H100 GPUs through a set of runtime and serving optimizations. Flux.2 [dev] is an open-weight model from Black Forest Labs for high-quality text-to-image generation and image editing, designed for strong prompt-following and production deployments. Qwen-Image is a foundation image generation model developed by Qwen, known for complex text rendering, precise image editing, and multilingual text generation.

Inference latency for Flux.2 [dev] across hardware and optimization configurations. Baseten FP4 on B200 delivers the fastest result at 0.98s, 5.4x faster than the H100 baseline and 2.3x faster than the B200 baseline.Inference latency for Flux.2 [dev] across hardware and optimization configurations. Baseten FP4 on B200 delivers the fastest result at 0.98s, 5.4x faster than the H100 baseline and 2.3x faster than the B200 baseline.
Inference latency for Qwen-Image across hardware and optimization configurations. Baseten FP4 on B200 delivers the fastest result at 0.87s, 4x faster than the H100 baseline and 1.6x faster than the B200 baseline.Inference latency for Qwen-Image across hardware and optimization configurations. Baseten FP4 on B200 delivers the fastest result at 0.87s, 4x faster than the H100 baseline and 1.6x faster than the B200 baseline.

Benchmarking methodology

To compare default and optimized image generation serving, we benchmarked each model under consistent single-request settings.

Futuristic scene generated by Flux.2 [dev]

Our benchmarking setup used:

  • Single-request latency

  • Single-GPU inference

  • B200 and H100 GPUs

  • 1024x1024 image generation

  • n = 1 images generated

  • num_inference_steps = 8

  • Fixed seed for reproducibility

The prompt sequence lengths ranged from 32 to 4096 tokens. Prompt length did not meaningfully affect latency, so we report a single representative latency value per configuration.

Overall, optimized image generation runs up to 2.3x faster for Flux.2 [dev] and up to 1.6x faster for Qwen-Image on B200 GPUs.

Image generation inference optimization

Image generation is latency-sensitive. Most production image generation workloads run at low batch sizes because users expect interactive responses. As a result, reducing single-request latency improves both user experience and cost efficiency.

Within the Baseten Inference Stack, we optimized image generation serving with:

  • Hardware-aware quantization

    • FP4 on B200

    • FP8 on H100

  • Memory optimizations

  • Optimized attention kernels

  • Specialized element-wise kernels

  • Runtime-level serving improvements

For Flux.2 [dev], FP4 on B200 provides the strongest result, bringing latency below one second. On H100, FP8 and memory optimizations eliminate the need for CPU offload and nearly halve latency.

For Qwen-Image, FP8 provides significant gains on both B200 and H100, while FP4 provides the strongest B200 result with a 1.57x speedup over baseline.

Image generation inference parameters

For these optimizations to be useful in production, they need to work across real image generation requests. The optimized serving stack supports common image generation parameters, including:

  • prompt: Text prompt describing the image

  • n: Number of images to generate

  • size: Output image dimensions, such as 1024x1024

  • response_format: Response format, such as b64_json

  • num_inference_steps: Number of diffusion steps

  • seed: Random seed for reproducibility

  • guidance_scale: Prompt adherence strength

  • negative_prompt: Text describing what to avoid

  • output_format: Output format, such as png, webp, jpeg, or jpg

  • background: Background handling, such as transparent, opaque, or auto

The benchmarks in this post used:

{
  "n": 1,
  "size": "1024x1024",
  "num_inference_steps": 8,
  "seed": 42
}

Further optimizations

Futuristic scene generated by Qwen-Image

The results in this post focus on Flux.2 [dev] and Qwen-Image, but the same serving approach can support additional image generation workloads.

Baseten can currently support workloads for other models, such as:

  • Qwen-Image-Layered

  • Flux.2 [klein]

  • Other Flux and Qwen-Image variants

Future optimization work includes additional workload- and use-case-specific runtime tuning, and further improvements to image generation latency on Blackwell and Hopper GPUs.

Image generation in production

Fast image generation only matters if it works reliably in production.

The Baseten Inference Stack is built for production workloads that need low latency, high reliability, and efficient GPU utilization. These optimizations make it easier to serve image generation models across both Hopper and Blackwell GPUs while reducing cost per request.

For teams building with Flux.2 [dev], Qwen-Image, Qwen-Image-Layered, Flux.2 [klein], or custom image generation models, Baseten’s model performance engineering team can deliver the same kinds of workload-specific optimizations in production. Reach out to talk to our engineers!

Subscribe to our newsletter

Stay up to date on model performance, inference infrastructure, and more.