Comparison

Best GPU for AI Image Generation 2026: Stable Diffusion, Flux, and SDXL VRAM Guide

Back to BlogWritten by Mitrasish, Co-founderMay 24, 2026
best gpu for ai image generationbest gpu for stable diffusionbest gpu for fluxbest gpu for ai artstable diffusion gpu requirementsSDXL VRAMFlux.1Flux.2Image Generation GPUGPU Cloud
Best GPU for AI Image Generation 2026: Stable Diffusion, Flux, and SDXL VRAM Guide

This guide covers VRAM requirements, step-time benchmarks, and cost per image across every GPU tier from RTX 4090 to H200, for SD 1.5 through Flux.2. If you want the interactive setup side, see the ComfyUI GPU cloud setup guide which covers Docker installation and workflow configuration. This post focuses on the GPU selection decision itself: which GPU to pick, at what cost, and why. For a broader GPU cloud buying guide covering provider selection, pricing transparency, and right-sizing across workloads, see the AI GPU buyers guide.

All prices are fetched from the Spheron live marketplace. Benchmark figures represent community-established data points for standard ComfyUI and diffusers workloads.

TL;DR: GPU Decision Matrix

Model compatibility by GPU tier:

GPUVRAMSD 1.5SDXLFlux.1 Dev (FP8)Flux.2 (FP8)SD3.5 LargeBest For
RTX 409024GB✓ (tight)~ (GGUF Q4 only)~Hobby, SDXL, Flux.1 FP8
L40S48GB✓ (BF16 too)High-res batch, Flux.1 BF16
H100 PCIe80GBProduction studio, video models
H100 SXM580GBMaximum throughput, video diffusion
H200 SXM5141GB✓ (BF16 too)Multi-model, 4K Flux, video
RTX PRO 600096GBProduction studio, full text encoder in VRAM

Legend: ✓ = fits comfortably, ~ = fits with caveats or quantization required, X = does not fit.

Use-case to GPU mapping:

Use CaseRecommended GPUEst. $/hr
Hobby single-image (SDXL, Flux.1 FP8)RTX 4090 on-demand$0.67
Production batch SDXLL40S on-demand$0.75
Production Flux.1 BF16 or heavy ControlNetL40S on-demand$0.75
Flux.2 FP8 productionH100 PCIe on-demand$2.09
Maximum throughput batchH100 SXM5 spot$0.80
Video diffusion (Wan 2.1, HunyuanVideo)H100 SXM5 or H200$3.90+
Multi-model serving or 4K FluxH200 SXM5 or RTX PRO 6000$0.59-$4.56

Prices as of 24 May 2026 and may fluctuate. Check current GPU pricing → for live rates.

VRAM Requirements by Model and Precision

The table below lists how much VRAM each model needs. All figures assume a single model loaded with no ControlNets, no LoRAs, and batch size 1. For Flux.2, the text encoder (Mistral Small 3.1, 24B) is assumed to be CPU-offloaded unless explicitly noted. Keeping the text encoder in VRAM adds approximately 12-13GB for FP8 or 24-26GB for BF16.

ModelFP16/BF16FP8INT8Notes
SD 1.5 base~2GB~1.5GB~1.5GBFits on nearly any GPU
SDXL base (no refiner)~8GB~6GB~5GB12GB+ comfortable for ControlNet
SDXL base + refiner~16GB~12GB~10GB24GB covers stacking + LoRAs
Flux.1 Schnell~18GB~12GBN/A4-step generation, FP8 standard
Flux.1 Dev~30-33GB~18-23GBN/A28-50 steps, FP8 for RTX 4090
Flux.2-dev (32B, text enc. CPU)~64GB~32GBN/ABF16 needs H200 or RTX PRO 6000
Flux.2-dev (32B, full, in VRAM)~90GB~44-45GBN/ARequires 80GB+ even in FP8
Flux.2-dev GGUF Q8_0N/AN/A~35GBText enc. CPU-offloaded via llama.cpp
Flux.2-dev GGUF Q4_K_SN/AN/A~19GBRTX 4090 compatible, moderate quality loss
SD3.5 Large~34GB~24GB~20GBTransformer backbone
SD3.5 Medium~16GB~12GB~10GBMore accessible, similar architecture

LoRA stacking VRAM formula

Each LoRA adapter adds roughly 100-500MB depending on rank (rank-8 LoRA is smaller, rank-128 is larger). For SDXL at BF16 on a 24GB card:

base model VRAM (8GB) + sum(LoRA sizes) + activations overhead (2-4GB) + ControlNet (3-6GB each)

Four rank-16 LoRAs at ~200MB each = 800MB. Two ControlNets at 5GB each = 10GB. SDXL BF16 total: 8 + 0.8 + 10 + 3 = 21.8GB. That fits on 24GB, but just barely. Adding a fifth LoRA or a third ControlNet risks OOM. On a 48GB L40S, the same setup has 26GB of headroom.

At FP8 precision, SDXL base drops to ~6GB and LoRAs remain the same size, giving more stacking room on 24GB cards.

RTX 4090: 24GB for SDXL and Flux.1 FP8

The RTX 4090 is the cost-efficient entry point for cloud GPU image generation at $0.67/hr on-demand. Its 24GB GDDR6X VRAM and 1,008 GB/s bandwidth handle SD 1.5, SDXL, and Flux.1 Dev FP8 without issue.

Diffusion model inference is memory-bandwidth-bound: each denoising step reads the full weight tensor from VRAM. The RTX 4090's 1,008 GB/s GDDR6X bandwidth sets its throughput ceiling for these workloads.

Step-time comparison (SDXL, 20 steps, 1024x1024):

GPUimg/minSec/imageOn-Demand $/hrCost/100 images
RTX 4090~28~2.1s$0.67~$0.040

Step-time comparison (Flux.1 Dev FP8, 28 steps):

GPUimg/minSec/imageOn-Demand $/hrCost/100 images
RTX 4090~13~4.6s$0.67~$0.086

At 24GB VRAM, the RTX 4090 hits hard limits. Flux.1 Dev BF16 (30-33GB) does not fit. Flux.2-dev FP8 (32GB for backbone with text encoder CPU-offloaded) does not fit. Only GGUF Q4_K_S variants (~19GB) of Flux.2-dev work on the RTX 4090, but with moderate quality loss versus FP8. For these models, step up to L40S (48GB) or H100 PCIe (80GB).

The case for renting an RTX 4090: If your workflow is SDXL or Flux.1 Dev FP8 and cost is the primary constraint, rent an RTX 4090 at $0.67/hr for the lowest cost per image at that VRAM tier.

L40S: 48GB for High-Res and Batch Generation

The L40S sits between the RTX 4090 consumer tier and the datacenter tier (H100). Its 48GB GDDR6 covers every diffusion model through Flux.2-dev FP8 (with text encoder on CPU), with room for ControlNet stacks and LoRA collections.

Why 48GB matters:

  • Flux.1 Dev BF16 (30-33GB) fits with 15-18GB to spare. The RTX 4090 at 24GB cannot run BF16 at all.
  • SDXL base + refiner + 4 ControlNets + 4 LoRAs fits within 40GB at BF16, something impossible on 24GB cards.
  • 2K upscaling pipelines that chain SDXL, an upscaler, and Real-ESRGAN simultaneously use 35-42GB total.
  • Overnight batch queues can keep the model resident in VRAM across all jobs with no reload overhead.

Step-time comparison (SDXL, 20 steps, 1024x1024):

GPUimg/minSec/imageOn-Demand $/hrCost/100 images
RTX 4090~28~2.1s$0.67~$0.040
L40S~24~2.5s$0.75~$0.052

The L40S is slightly slower than the RTX 4090 for single-image SDXL: GDDR6 at 864 GB/s vs GDDR6X at 1,008 GB/s on the RTX 4090. The cost per image is comparable, and for complex multi-ControlNet setups the L40S holds all assets in VRAM simultaneously when a 24GB card cannot.

Batch generation advantage: When processing a queue of 500 images overnight, keeping the model in VRAM between jobs eliminates per-job reload time (typically 8-15 seconds per SDXL model load). On the L40S, you load once and run. On a 24GB card, complex setups may require reloading between jobs to free VRAM. Over 500 images, that's 70-125 minutes of reload time the L40S avoids entirely.

Step-time comparison (Flux.1 Dev, 28 steps):

GPUPrecisionimg/minOn-Demand $/hrCost/100 images
RTX 4090FP8~13$0.67~$0.086
L40SFP8~11$0.75~$0.114
L40SBF16~8$0.75~$0.156

The L40S is slightly slower than the RTX 4090 for Flux.1 FP8 per-image but costs marginally more per hour. Its key advantage is running Flux.1 BF16 without quantization, which the RTX 4090 cannot do at all (24GB is insufficient for BF16). If your use case requires full BF16 precision for Flux.1, L40S GPU rental is the cheapest tier that handles it.

H100: When 80GB Changes What's Possible

The jump from 32-48GB to 80GB opens workloads that simply do not run on consumer or prosumer cards.

Flux.2-dev FP8 with text encoder in VRAM: The full setup (backbone ~32GB + text encoder FP8 ~13GB) needs 44-45GB. H100 PCIe at 80GB handles this comfortably with 35GB spare for activations and batch headroom. The L40S at 48GB fits this configuration with only a few GB of margin, while the RTX 4090 at 24GB cannot run it at all.

Video diffusion models: Wan 2.1 (14B, BF16) requires 60-70GB at 720p with model offloading enabled. FP8 reduces this to roughly 40-50GB. HunyuanVideo requires a minimum of 60GB at 720p (80GB recommended). Both are completely out of reach on anything below H100 PCIe.

Multi-model serving: Two Flux.1 Dev FP8 instances simultaneously fit in 80GB (2x 23GB = 46GB, with room for activations and KV state). Flux.1 + SDXL base simultaneously also fits. For API servers handling multiple concurrent generation requests without swapping models, H100 PCIe is the minimum practical option.

4K image generation: Tiled diffusion at 4K resolution with large overlap buffers consumes 50-60GB of VRAM for the intermediate activation tensors. H100 PCIe covers this; anything smaller requires aggressive tiling with visible seams.

H100 SXM vs PCIe for diffusion workloads:

MetricH100 PCIeH100 SXM5
VRAM80GB HBM2e80GB HBM3
Bandwidth2,000 GB/s3,350 GB/s
On-demand $/hr$2.09$3.90
Spot $/hrN/A$0.80
SDXL img/min~42~65
Flux.1 Dev FP8 img/min~26~38

The H100 SXM5's HBM3 at 3,350 GB/s is 67% faster than the PCIe's HBM2e at 2,000 GB/s. For bandwidth-bound diffusion inference, that translates directly to proportionally more images per hour. At on-demand rates, the H100 PCIe at $2.09/hr is lower than the H100 SXM5 at $3.90/hr. But at $0.80/hr spot, the H100 SXM5 delivers the lowest cost per image of any GPU in the 80GB tier.

For teams running sustained batch jobs at scale, H100 SXM5 rental on Spheron on spot is the most cost-efficient option in the datacenter tier. For interactive or low-volume production that needs 80GB VRAM without spot risk, H100 PCIe on-demand at $2.09/hr works well.

For the Flux.2 production deployment side, the Flux.2 production deployment guide covers ComfyUI and diffusers setup, Docker configuration, and A100 vs H100 PCIe throughput numbers for FP8. For a direct benchmark and pricing comparison between those two generations, see the A100 vs H100 guide.

SDXL cost-per-image (H100 SXM5 spot vs H100 PCIe):

Configimg/minCost/100 images
H100 SXM5 spot ($0.80/hr)~65~$0.021
H100 PCIe on-demand ($2.09/hr)~42~$0.083

The spot price makes H100 SXM5 the most cost-efficient SDXL option in the entire lineup, approximately 4x cheaper per 100 images than H100 PCIe on-demand, at the cost of potential preemption.

H200: 141GB HBM3e for Video Diffusion and 4K Flux

The H200's 141GB HBM3e at 4,800 GB/s bandwidth addresses a narrow set of use cases where H100 falls short.

When H200 vs H100: The extra 61GB matters in three specific scenarios:

  1. Full-res HunyuanVideo at 1080p (requires ~100-120GB VRAM for the largest configurations). H100 at 80GB forces aggressive model offloading that cuts throughput by 40-60%.
  2. 4K tiled Flux.2-dev at full precision with large tile overlaps, where intermediate activation buffers reach 80-100GB at maximum quality settings.
  3. Multi-model setups with three or more models resident simultaneously (e.g., Flux.2-dev FP8 + Flux.1 Schnell + SD3.5 Medium all loaded and ready for routing without reload).

For standard Flux.2-dev FP8 with text encoder CPU-offloaded, the H200 adds no practical advantage over H100 PCIe. The 80GB on H100 PCIe already covers Flux.2-dev FP8 with 35GB of headroom. The H200's value shows up at the edge cases above.

Current Spheron pricing: H200 SXM5 at $4.56/hr on-demand, with spot pricing available at $2.00/hr. For H200 GPU rental for video diffusion workflows, the on-demand rate is the more predictable option when jobs cannot be interrupted.

RTX PRO 6000 Blackwell: 96GB for Production Studios

The RTX PRO 6000 Blackwell occupies an unusual position: 96GB of GDDR7 on a workstation-class GPU at $1.77/hr on-demand ($0.59/hr spot), cheaper than H100 PCIe at $2.09/hr while offering 16GB more VRAM.

What 96GB unlocks:

  • Flux.2-dev FP8 backbone (32GB) + full Mistral Small 3.1 text encoder in FP8 (13GB) = 45GB. 51GB of headroom remains for activations, multiple LoRAs, and batch processing. No text encoder CPU-offloading needed.
  • Flux.2-dev BF16 backbone (64GB) + text encoder BF16 (26GB) = 90GB. Tight but fits.
  • Flux.1 Dev BF16 (30GB) simultaneously with an SDXL full pipeline (16GB) = 46GB. Multi-model routing without model swapping.
  • Entire LoRA collections for a single model loaded resident in VRAM at once.

RTX PRO 6000 vs H100 PCIe for image generation:

MetricRTX PRO 6000H100 PCIe
VRAM96GB GDDR780GB HBM2e
Bandwidth1,792 GB/s2,000 GB/s
On-demand $/hr$1.77$2.09
Spot $/hr$0.59N/A
Flux.1 Dev FP8 img/min~23~26
Flux.2 FP8 (text enc in VRAM)YesYes
ECC memoryYesYes

The H100 PCIe's HBM2e has ~11% more memory bandwidth than the RTX PRO 6000's GDDR7. For bandwidth-bound inference, the H100 PCIe generates images ~11% faster. The RTX PRO 6000 costs ~15% less per hour on-demand and offers 16GB more VRAM (96GB vs 80GB).

For workloads that require more than 80GB VRAM (Flux.2-dev with full text encoder in VRAM, multi-model setups), the RTX PRO 6000 rental offers more capacity at a lower on-demand rate than H100 PCIe. For pure SDXL throughput, the H100 PCIe actually costs less per 100 images due to its faster bandwidth. The RTX PRO 6000 spot at $0.59/hr is where the value is sharpest: $0.043/100 Flux.1 images with all 96GB available.

The RTX PRO 6000's blower cooler exhausts heat directly out the back of the chassis, suitable for rack-mounted studio deployments.

Cost-Per-Image: Consumer vs Prosumer vs Datacenter

The table below uses benchmark figures for two workflows: SDXL base 1024x1024 at 20 steps, and Flux.1 Dev FP8 at 28 steps. For third-party providers, prices shown are approximate based on published public rates at time of writing and may differ from current rates.

GPUProvider$/hrSDXL img/hrFlux.1 Dev FP8 img/hrCost/100 SDXLCost/100 Flux.1
RTX 4090Spheron on-demand$0.67~1,680~780~$0.040~$0.086
L40SSpheron on-demand$0.75~1,440~660~$0.052~$0.114
RTX PRO 6000Spheron on-demand$1.77~1,380~1,380~$0.128~$0.128
RTX PRO 6000Spheron spot$0.59~1,380~1,380~$0.043~$0.043
H100 PCIeSpheron on-demand$2.09~2,520~1,560~$0.083~$0.134
H100 SXM5Spheron on-demand$3.90~3,900~2,280~$0.100~$0.171
H100 SXM5Spheron spot$0.80~3,900~2,280~$0.021~$0.035
H100 PCIeRunPod on-demand~$2.89~2,520~1,560~$0.115~$0.186
H100 PCIeLambda on-demand~$2.49~2,520~1,560~$0.099~$0.160

Pricing fluctuates based on GPU availability. The prices above are based on 24 May 2026 and may have changed. Check current GPU pricing for live rates.

Key takeaways from the table:

H100 SXM5 spot at $0.80/hr delivers the lowest cost per 100 images in the entire lineup for both SDXL ($0.021) and Flux.1 ($0.035). The trade-off is spot preemption risk.

RTX 4090 on-demand at $0.040/100 SDXL images is 2x cheaper than H100 PCIe on-demand ($0.083/100). For SDXL-only workflows, the case for H100 PCIe is weak unless you need its 80GB VRAM for video models.

RTX PRO 6000 spot at $0.59/hr delivers Flux.1 Dev FP8 at $0.043/100 images, offering more VRAM and no text encoder CPU-offload requirement at a fraction of H100 PCIe's cost.

RunPod and Lambda list H100 PCIe at $2.89/hr and $2.49/hr respectively, making their per-image costs 1.4x-2.4x higher than Spheron's rates for the same hardware.

Deploying on Spheron: Sample Workflow

On-demand for interactive ComfyUI: Rent an RTX 4090 at $0.67/hr. Pull the ComfyUI Docker image, run via docker run -d --gpus all --ipc=host -p 127.0.0.1:8188:8188, access via SSH tunnel. Generate your session's images. Shut the instance down when done. For a 3-hour session at $0.67/hr: $2.01 total.

Spot for overnight batch SDXL: Provision an H100 SXM5 spot instance at $0.80/hr. Submit your 5,000-image batch job. At 65 img/min on H100 SXM5, 5,000 images take ~77 minutes, costing approximately $1.02. Compare: RTX 4090 on-demand would take ~179 minutes at $0.67/hr = $1.99 total. The H100 SXM5 spot is both faster and cheaper.

Spot preemption strategy: Save partial outputs after every 100 images with a ComfyUI batch output checkpoint. For diffusers-based APIs, write output files to disk after each generation call rather than accumulating in memory. On preemption, restart and load from the last saved position. For overnight batches where completion time is flexible, spot preemption adds minimal overhead.

For the full setup walkthrough including Docker configuration and SSH tunnel access, see the ComfyUI on GPU cloud setup guide. For current rates across all GPU models, see GPU pricing on Spheron.


Running diffusion models at scale comes down to matching VRAM to your model precision, then finding the lowest $/hr for that VRAM tier. Rent an RTX 4090 for hobby Flux.1 and SDXL, L40S for high-res batch, or H100 for production studios and video diffusion.

Check live GPU pricing

STEPS / 04

Quick Setup Guide

  1. Pick your GPU tier by model and use case

    Use the TL;DR matrix in this guide to match your diffusion model (SD 1.5, SDXL, Flux.1, Flux.2, SD3.5) to a GPU tier. Hobby single-image: RTX 4090 on-demand. Production batch SDXL or Flux.1: H100 SXM spot or L40S on-demand. High-res or multi-model: L40S or H100 PCIe. Video diffusion or 4K Flux: H200 or RTX PRO 6000.

  2. Rent the GPU on Spheron

    Go to app.spheron.ai, select your GPU model, choose Ubuntu 22.04, and launch. For spot instances (H100 SXM), enable the spot toggle before deploying. Spot can be preempted but costs 60-80% less. Have your workflow or Docker setup ready to run immediately after the instance boots.

  3. Deploy ComfyUI or a diffusers API

    For interactive use: docker pull ghcr.io/ai-dock/comfyui:latest-cuda, run with --gpus all -p 127.0.0.1:8188:8188, and access via SSH tunnel. For production APIs: use the diffusers library with FastAPI, load the pipeline once at startup, and enable torch.compile for sustained throughput gains of 30-50%.

  4. Calculate your cost per image before committing

    Estimate images per hour using the benchmark tables in this guide. Divide the hourly GPU rate by images per hour to get cost per image. For overnight batch jobs, compare on-demand vs spot: spot pricing on H100 SXM saves ~79% for interruptible workloads.

FAQ / 05

Frequently Asked Questions

SDXL requires a minimum of 8GB VRAM at FP16 (1024x1024, base model only). 12GB is comfortable for the base plus refiner. 24GB (RTX 4090) gives headroom for ControlNet stacking and LoRA combinations. On cloud, an RTX 4090 at under $1/hr is the cost-efficient choice for SDXL batch runs.

Flux.1 Dev in FP8 quantization uses around 18-23GB of VRAM, which fits the RTX 4090's 24GB with margin. BF16 needs 30-33GB and will not fit. Flux.1 Schnell in FP8 fits easily. For Flux.2 (32B), the RTX 4090 only runs Q4_K GGUF variants (~19GB); for full quality use an H100 PCIe or A100 80G with FP8.

For SDXL 1024x1024 at 20 steps: approximately $0.040 per 100 images on an RTX 4090 and $0.083 on an H100 PCIe at on-demand rates. Spot pricing on H100 SXM cuts that to roughly $0.021 per 100 images. For Flux.1 Dev FP8 at 28 steps: around $0.086 per 100 images on an RTX 4090.

Yes for high-resolution and batch generation. The L40S's 48GB GDDR6 fits Flux.1 Dev in BF16 (30-33GB), multiple ControlNets simultaneously, high-res upscaling pipelines, and batch queues without model reload. For single-image workflows, the RTX 4090 offers better cost efficiency. For overnight batch runs or 2K-4K output, the L40S is a better fit.

For Flux.2-dev (32B) in FP8 quantization, you need 32GB of VRAM for the backbone (with text encoder CPU-offloaded), which exceeds what an RTX 4090 (24GB) can handle. For reliable production runs without OOM risk, an H100 PCIe (80GB) or A100 80G is the safer choice. H100 SXM handles Flux.2 BF16 with text encoder CPU-offloaded. H200 (141GB) and RTX PRO 6000 (96GB) are suited for multi-model serving or 4K output pipelines.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.