Sub-second image generation with Flux.2 and Qwen-Image

Image generation has become a core inference workload for creative tools, design workflows, marketing applications, and various AI-native products. Like video generation, it relies on an iterative denoising process in latent space, repeatedly refining an image over multiple sampling steps. These workloads are highly latency-sensitive and usually run at very small batch sizes, so improving single-request latency directly improves user experience, throughput, and cost efficiency.

In this post, we show how the Baseten Inference Stack accelerates image generation for Flux.2 [dev] and Qwen-Image on NVIDIA B200 and H100 GPUs through a set of runtime and serving optimizations. Flux.2 [dev] is an open-weight model from Black Forest Labs for high-quality text-to-image generation and image editing, designed for strong prompt-following and production deployments. Qwen-Image is a foundation image generation model developed by Qwen, known for complex text rendering, precise image editing, and multilingual text generation.

Inference latency for Flux.2 [dev] across hardware and optimization configurations. Baseten FP4 on B200 delivers the fastest result at 0.98s, 5.4x faster than the H100 baseline and 2.3x faster than the B200 baseline.

Inference latency for Qwen-Image across hardware and optimization configurations. Baseten FP4 on B200 delivers the fastest result at 0.87s, 4x faster than the H100 baseline and 1.6x faster than the B200 baseline.

Benchmarking methodology

To compare default and optimized image generation serving, we benchmarked each model under consistent single-request settings.

Futuristic scene generated by Flux.2 [dev]

Our benchmarking setup used:

Single-request latency
Single-GPU inference
B200 and H100 GPUs
1024x1024 image generation
n = 1 images generated
num_inference_steps = 8
Fixed seed for reproducibility

The prompt sequence lengths ranged from 32 to 4096 tokens. Prompt length did not meaningfully affect latency, so we report a single representative latency value per configuration.

Overall, optimized image generation runs up to 2.3x faster for Flux.2 [dev] and up to 1.6x faster for Qwen-Image on B200 GPUs.

Image generation inference optimization

Image generation is latency-sensitive. Most production image generation workloads run at low batch sizes because users expect interactive responses. As a result, reducing single-request latency improves both user experience and cost efficiency.

Within the Baseten Inference Stack, we optimized image generation serving with:

Hardware-aware quantization
- FP4 on B200
- FP8 on H100
Memory optimizations
Optimized attention kernels
Specialized element-wise kernels
Runtime-level serving improvements

For Flux.2 [dev], FP4 on B200 provides the strongest result, bringing latency below one second. On H100, FP8 and memory optimizations eliminate the need for CPU offload and nearly halve latency.

For Qwen-Image, FP8 provides significant gains on both B200 and H100, while FP4 provides the strongest B200 result with a 1.57x speedup over baseline.

Image generation inference parameters

For these optimizations to be useful in production, they need to work across real image generation requests. The optimized serving stack supports common image generation parameters, including:

prompt: Text prompt describing the image
n: Number of images to generate
size: Output image dimensions, such as 1024x1024
response_format: Response format, such as b64_json
num_inference_steps: Number of diffusion steps
seed: Random seed for reproducibility
guidance_scale: Prompt adherence strength
negative_prompt: Text describing what to avoid
output_format: Output format, such as png, webp, jpeg, or jpg
background: Background handling, such as transparent, opaque, or auto

The benchmarks in this post used:

{
  "n": 1,
  "size": "1024x1024",
  "num_inference_steps": 8,
  "seed": 42
}

Further optimizations

Futuristic scene generated by Qwen-Image

The results in this post focus on Flux.2 [dev] and Qwen-Image, but the same serving approach can support additional image generation workloads.

Baseten can currently support workloads for other models, such as:

Qwen-Image-Layered
Flux.2 [klein]
Other Flux and Qwen-Image variants

Future optimization work includes additional workload- and use-case-specific runtime tuning, and further improvements to image generation latency on Blackwell and Hopper GPUs.

Image generation in production

Fast image generation only matters if it works reliably in production.

The Baseten Inference Stack is built for production workloads that need low latency, high reliability, and efficient GPU utilization. These optimizations make it easier to serve image generation models across both Hopper and Blackwell GPUs while reducing cost per request.

For teams building with Flux.2 [dev], Qwen-Image, Qwen-Image-Layered, Flux.2 [klein], or custom image generation models, Baseten’s model performance engineering team can deliver the same kinds of workload-specific optimizations in production. Reach out to talk to our engineers!

Sub-second image generation with Flux.2 and Qwen-Image

Authors

Last updated

Share

Benchmarking methodology

Image generation inference optimization

Image generation inference parameters

Further optimizations

Image generation in production

Related posts

AI training vs. inference: what's the difference?

Live draft model training for speculative decoding

How we built the world’s fastest API for GLM-5.2

Explore Baseten today

Related posts

AI training vs. inference: what's the difference?

Live draft model training for speculative decoding

How we built the world’s fastest API for GLM-5.2