Stable Diffusion XL is the most popular open-source image model in 2026. ComfyUI shipped 4M+ downloads. Automatic1111 still has its loyal user base. Forge, Fooocus, InvokeAI, SwarmUI — every flavour exists. They all run SDXL. The question for most users isn't "which UI" but "will my PC run it well?"
This article gives you calibrated answers. We'll cover the hardware test you can run in 15 seconds (no install), then break down expected SDXL performance by GPU tier with real seconds-per-image numbers from public ComfyUI benchmarks.
The 15-second hardware test (in your browser)
9bench.com runs a hardware probe via WebGPU + WebAssembly + Web Workers. Open the page, click Start, wait 15 seconds. Result: your CPU/GPU/RAM scores plus an AI Capabilities section that predicts SDXL feasibility on your hardware.
What it actually measures and how it predicts SDXL:
- GPU detection via WEBGL_debug_renderer_info — extracts the actual GPU model
- FP16 support check via WebGPU shader-f16 feature — required for SDXL native speed (FP32 fallback is 2× slower)
- VRAM probe — checks max allocatable buffer size, infers usable VRAM
- GPU class lookup — matches your detected GPU against a curated table of 50+ entries with calibrated SDXL times sourced from ComfyUI benchmarks, TechPowerUp, public Hugging Face Spaces measurements
- Result — predicts seconds per 1024×1024 image (low/high range based on sampler choice)
This isn't a deep-learning benchmark — we don't actually run SDXL in your browser (it'd take minutes). It's a calibrated lookup based on your detected hardware. Honest about being a prediction, not a measurement.
Tier-by-tier SDXL performance (1024×1024, 30 steps + refiner)
Numbers below are median seconds per image on stock SDXL (no LoRA, no ControlNet) generating a 1024×1024 image with 30 steps base + 10 steps refiner. Sampler: DPM++ 2M Karras. Sources: ComfyUI public benchmarks, Civitai user reports, Tom's Hardware AI workload tests.
Tier 1: Beast (under 6 seconds per image)
| GPU | VRAM | SDXL 1024² (sec) | Batch 4 feasible? |
|---|---|---|---|
| RTX 5090 | 32 GB | 2-4 | Yes (batch 8+) |
| RTX 4090 | 24 GB | 3-5 | Yes (batch 4-6) |
| RX 7900 XTX | 24 GB | 4-7 | Yes (batch 4) |
| RTX 5080 | 16 GB | 3-5 | Yes (batch 2-3) |
| RTX 4080 Super | 16 GB | 4-7 | Yes (batch 2-3) |
| Apple M3 Ultra | up to 192 GB unified | 5-9 | Yes (memory-rich) |
Tier 1 is "make it as fast as it can be". Suitable for: Civitai-style mass image creation, SDXL-Turbo experimentation at 60+ images/minute, ComfyUI animation workflows, training LoRAs locally.
Tier 2: Workstation (6-15 seconds per image)
| GPU | VRAM | SDXL 1024² (sec) | Batch 2 feasible? |
|---|---|---|---|
| RTX 5070 Ti | 16 GB | 4-7 | Yes |
| RTX 4070 Ti Super | 16 GB | 6-10 | Yes |
| RTX 3090 (used) | 24 GB | 7-11 | Yes |
| RX 7900 XT | 20 GB | 5-8 | Yes |
| RTX 4070 Super | 12 GB | 8-13 | Yes (tight) |
| RX 7800 XT | 16 GB | 8-12 | Yes |
| Apple M3 Max | up to 64 GB unified | 10-17 | Yes |
| RTX 4070 | 12 GB | 9-14 | Yes (tight) |
Tier 2 is the practical creator tier. Generate 4-10 images per minute. Batch generation works. LoRAs and ControlNet add ~30-50% overhead. Refiner stays enabled.
Tier 3: Mainstream (15-30 seconds per image)
| GPU | VRAM | SDXL 1024² (sec) | Refiner advised? |
|---|---|---|---|
| RTX 4060 Ti 16GB | 16 GB | 12-20 | Yes |
| RTX 3080 | 10 GB | 9-14 | Yes (tight VRAM) |
| RTX 3070 Ti | 8 GB | 12-19 | Disable for batch |
| RX 7700 XT | 12 GB | 10-15 | Yes |
| RTX 4060 Laptop | 8 GB | 18-30 | Disable for batch |
| RTX 4060 | 8 GB | 14-22 | Disable for batch |
| RX 6700 XT | 12 GB | 22-35 | Yes |
| Apple M3 Pro | 18-36 GB unified | 14-24 | Yes |
| RTX 3060 12GB | 12 GB | 20-32 | Yes |
Tier 3 is "it works but be patient". Generate 2-4 images per minute. Use SDXL-Turbo or LCM LoRA for fast iteration; switch to full 30-step DPM++ for finals. 8 GB VRAM cards work but require --medvram and disabling the refiner for stable batch generation.
Tier 4: Working (30-90 seconds per image)
| GPU | VRAM | SDXL 1024² (sec) | Tips |
|---|---|---|---|
| GTX 1080 Ti | 11 GB | 30-50 | Use SDXL-Turbo (8 steps), no refiner |
| RTX 2060 / 2070 | 6-8 GB | 35-70 | --medvram, smaller resolution first |
| Apple M2 / M2 Pro | 16+ GB unified | 30-50 | Use Diffusers with MPS backend |
| Apple M1 Max | 32+ GB unified | 18-30 | Better than people expect |
| RX 6600 / 6650 XT | 8 GB | 28-45 | ROCm or Vulkan path; --medvram |
| Apple M1 Pro | 16+ GB unified | 25-40 | Diffusers MPS |
Tier 4 is "you can do it but switch to SDXL-Turbo or LCM-LoRA for usable iteration". Generate 1-2 images per minute on full 30-step. Generate 6-15 images per minute on Turbo (8 steps, no refiner). Most users on this tier should default to Turbo workflows.
Tier 5: Patient (90+ seconds, or skip SDXL for SD 1.5)
On Intel Iris Xe, AMD Radeon 680M, low-end APUs, or ancient discrete GPUs (GTX 1060 6 GB, RX 580): SDXL is technically possible but punishing. Better path:
- Use Stable Diffusion 1.5 instead — generates 512×512 images in 5-15s on the same hardware. Quality is lower but practical.
- Use SDXL-Turbo at 1 step — yes, just 1 step. Quality drops noticeably but 90s images become 15s images.
- Use cloud — Hugging Face Spaces, Replicate, or RunPod give you 5-second SDXL for $0.001-0.01 per image. Cheaper than electricity for hours of local generation.
The 8 GB VRAM trap
SDXL was designed for 12+ GB VRAM but the community has built escape hatches. If you have 8 GB:
- Automatic1111 / Forge: add
--medvram-sdxlto launch flags. ~30% slower but stable. - ComfyUI: enable
--lowvramor use Tile VAE node + sequential CPU offload. - Disable the refiner: full SDXL pipeline = base + refiner. The refiner adds quality but doubles VRAM peak. Skip it on 8 GB cards.
- Smaller initial resolution: generate at 832×832, upscale via Hires Fix to 1.25×. Final image is similar to 1024×1024 but VRAM peak is lower.
- SDXL-Turbo or LCM: 8 steps instead of 30, no refiner. Half the VRAM peak.
All of these work. None are as comfortable as having 12+ GB. If you're shopping for a GPU primarily for SDXL: spend the extra $100-150 for a 12 GB+ card. RTX 3060 12GB used at $200 is the price/perf champion for budget SDXL.
Apple Silicon for SDXL: better than the reputation suggests
Common misconception: "Apple is bad for image generation". Reality is more nuanced.
What's true: on a per-image basis, NVIDIA wins. M3 Max takes 10-17s vs RTX 4090's 3-5s. The 4090 is roughly 3× faster.
What's not true: Apple is unusably slow. M3 Pro at 14-24s/image is comparable to an RTX 3060 12GB at 20-32s. M3 Max at 10-17s/image beats RTX 4060 Ti and RTX 3070. M3 Ultra at 5-9s/image beats everything except RTX 4080 Super and up.
Apple specifics:
- Use Diffusers with MPS backend (Hugging Face Diffusers library) — fastest on Apple. Roughly 2× faster than CoreML route.
- Use DrawThings app (free, App Store) — easiest UI, well-optimised for Apple Silicon
- Avoid Automatic1111 directly — works but slower than Diffusers/DrawThings on Mac
- Unified memory advantage: not useful for SDXL (model fits in 24 GB anyway). It's an LLM win, not an SDXL win.
SDXL in the browser (yes, this works in 2026)
Browser-based SDXL via WebGPU is now a real option. Hugging Face Spaces hosts dozens of SDXL/Turbo demos. ONNX Runtime Web supports SDXL inference. Browser performance:
| Hardware | Native SDXL (s) | Browser SDXL (s) | Slowdown |
|---|---|---|---|
| RTX 4090 | 3-5 | 20-40 | ~6-8× |
| RTX 4070 | 9-14 | 50-90 | ~6× |
| M3 Max | 10-17 | 40-80 | ~4-5× |
| RTX 3060 12GB | 20-32 | 90-150 | ~5× |
The slowdown ratio is similar to browser vs native LLM: 5-8× depending on hardware. Browser SDXL works well for demos, "try this without installing anything" pages, and quick triage. Not a replacement for native ComfyUI workflows.
What 9bench specifically tells you about SDXL
Run 9bench.com, scroll to AI Capabilities. You'll see:
- SDXL feasibility verdict: "Easy" / "Comfortable" / "Memory-saver mode required" / "Painful" / "Don't bother"
- Calibrated time-per-image range: e.g. "8-13 seconds per 1024×1024 image" — sourced from your detected GPU's class
- Browser vs Native breakdown: what the same hardware would do in ComfyUI native vs browser-only WebGPU
- VRAM verdict: enough for refiner, enough for batch, enough for LoRA training
- Recommended workflow: SDXL-Turbo vs full 30-step, refiner on/off, --medvram or no
All this in the same 15-second test that benchmarks your CPU/GPU/RAM. Free, no install, no signup. Open methodology — every number traceable to a public benchmark source.
Common questions
"Can I run SDXL on my GTX 1060 6GB?" Technically yes with --lowvram, but expect 60-120s/image and frequent crashes. SD 1.5 is the better fit for a 1060 6GB.
"Should I buy a used RTX 3090 for SDXL?" Yes — best price/perf VRAM at $700-900. 24 GB lets you run massive ControlNet stacks, train LoRAs, batch-generate without OOM. Faster than a new RTX 4070 for SDXL despite being 4 years old.
"Is SDXL faster on Linux or Windows?" Linux is 5-10% faster on NVIDIA (better CUDA driver overhead) and notably faster on AMD (ROCm 6.0+ matures Linux first). For most users the difference isn't worth dual-booting.
"Will SDXL ever run faster on my hardware?" Yes, every 6 months. SDXL optimisations keep landing — TensorRT for NVIDIA cuts 30%, ROCm Composable Kernels for AMD cut 20%, LCM-LoRA cuts steps by 4-8×. A GPU bought today will be ~2× faster in 2 years on the same SDXL workload.
"What about Stable Diffusion 3 / Flux / SDXL Lightning?" Flux.1 (Black Forest Labs) is heavier than SDXL — needs 12 GB minimum, 16 GB+ comfortable, ~50% slower per image. SD3 Medium is similar to SDXL. SDXL Lightning is much faster (1-step) but lower quality. The hardware tiers above scale similarly: if your tier handles SDXL, it handles SD3 and most Flux quants. Flux fp16 is the exception — that's an RTX 4080+ workload.
Test your PC for SDXL — 15 seconds
9bench detects your GPU, checks FP16 / VRAM / WebGPU support, and predicts SDXL generation time on your specific hardware. Browser-only, no install, no signup. Calibrated against ComfyUI / Automatic1111 public benchmarks.
Test my PC for SDXL →