We optimized image generation serving for Flux.2 [dev] and Qwen-Image, delivering 2.3x and 1.6x speedups over SGLang on NVIDIA Blackwell GPUs, and 1.9x and 1.1x speedups on NVIDIA Hopper GPUs.
Image generation has become a core inference workload for creative tools, design workflows, marketing applications, and various AI-native products. Like video generation, it relies on an iterative denoising process in latent space, repeatedly refining an image over multiple sampling steps. These workloads are highly latency-sensitive and usually run at very small batch sizes, so improving single-request latency directly improves user experience, throughput, and cost efficiency.
In this post, we show how the Baseten Inference Stack accelerates image generation for Flux.2 [dev] and Qwen-Image on NVIDIA B200 and H100 GPUs through a set of runtime and serving optimizations. Flux.2 [dev] is an open-weight model from Black Forest Labs for high-quality text-to-image generation and image editing, designed for strong prompt-following and production deployments. Qwen-Image is a foundation image generation model developed by Qwen, known for complex text rendering, precise image editing, and multilingual text generation.
Inference latency for Flux.2 [dev] across hardware and optimization configurations. Baseten FP4 on B200 delivers the fastest result at 0.98s, 5.4x faster than the H100 baseline and 2.3x faster than the B200 baseline.
Inference latency for Qwen-Image across hardware and optimization configurations. Baseten FP4 on B200 delivers the fastest result at 0.87s, 4x faster than the H100 baseline and 1.6x faster than the B200 baseline.Benchmarking methodology
To compare default and optimized image generation serving, we benchmarked each model under consistent single-request settings.
![Futuristic scene generated by Flux.2 [dev]](/_next/image/?url=https%3A%2F%2Fwww.datocms-assets.com%2F104802%2F1779096569-genai1.png%3Fauto%3Dformat%26w%3D1200&w=3840&q=75)
Our benchmarking setup used:
Single-request latency
Single-GPU inference
B200 and H100 GPUs
1024x1024 image generation
n = 1images generatednum_inference_steps = 8Fixed seed for reproducibility
The prompt sequence lengths ranged from 32 to 4096 tokens. Prompt length did not meaningfully affect latency, so we report a single representative latency value per configuration.
Overall, optimized image generation runs up to 2.3x faster for Flux.2 [dev] and up to 1.6x faster for Qwen-Image on B200 GPUs.
Image generation inference optimization
Image generation is latency-sensitive. Most production image generation workloads run at low batch sizes because users expect interactive responses. As a result, reducing single-request latency improves both user experience and cost efficiency.
Within the Baseten Inference Stack, we optimized image generation serving with:
Hardware-aware quantization
FP4 on B200
FP8 on H100
Memory optimizations
Optimized attention kernels
Specialized element-wise kernels
Runtime-level serving improvements
For Flux.2 [dev], FP4 on B200 provides the strongest result, bringing latency below one second. On H100, FP8 and memory optimizations eliminate the need for CPU offload and nearly halve latency.
For Qwen-Image, FP8 provides significant gains on both B200 and H100, while FP4 provides the strongest B200 result with a 1.57x speedup over baseline.
Image generation inference parameters
For these optimizations to be useful in production, they need to work across real image generation requests. The optimized serving stack supports common image generation parameters, including:
prompt:Text prompt describing the imagen:Number of images to generatesize:Output image dimensions, such as 1024x1024response_format:Response format, such as b64_jsonnum_inference_steps:Number of diffusion stepsseed:Random seed for reproducibilityguidance_scale:Prompt adherence strengthnegative_prompt:Text describing what to avoidoutput_format:Output format, such as png, webp, jpeg, or jpgbackground:Background handling, such as transparent, opaque, or auto
The benchmarks in this post used:
{
"n": 1,
"size": "1024x1024",
"num_inference_steps": 8,
"seed": 42
}
Further optimizations

The results in this post focus on Flux.2 [dev] and Qwen-Image, but the same serving approach can support additional image generation workloads.
Baseten can currently support workloads for other models, such as:
Qwen-Image-Layered
Flux.2 [klein]
Other Flux and Qwen-Image variants
Future optimization work includes additional workload- and use-case-specific runtime tuning, and further improvements to image generation latency on Blackwell and Hopper GPUs.
Image generation in production
Fast image generation only matters if it works reliably in production.
The Baseten Inference Stack is built for production workloads that need low latency, high reliability, and efficient GPU utilization. These optimizations make it easier to serve image generation models across both Hopper and Blackwell GPUs while reducing cost per request.
For teams building with Flux.2 [dev], Qwen-Image, Qwen-Image-Layered, Flux.2 [klein], or custom image generation models, Baseten’s model performance engineering team can deliver the same kinds of workload-specific optimizations in production. Reach out to talk to our engineers!
