Tutorial

ComfyUI on GPU Cloud 2026: RTX 5090 vs H100 for Stable Diffusion & Flux

Back to BlogWritten by Mitrasish, Co-founderMar 13, 2026
ComfyUIStable DiffusionFlux.1RTX 5090H100GPU CloudImage GenerationDiffusion Models
ComfyUI on GPU Cloud 2026: RTX 5090 vs H100 for Stable Diffusion & Flux

You want to run ComfyUI in the cloud to generate SDXL or Flux.1 images at scale, faster than your local GPU and without burning your own hardware 24/7. On Spheron, you can provision an RTX 5090 at $0.76/hr or an H100 at $2.01/hr on demand, run your generation jobs, and shut everything down when you're done.

The interesting comparison in 2026 isn't H100 vs A100. It's RTX 5090 vs H100 PCIe. These two GPUs have surprisingly similar memory bandwidth: GDDR7 at 1,792 GB/s vs HBM2e at 2,000 GB/s, an 11% gap. But the price difference is 2.6x. For diffusion model inference, which is largely memory-bandwidth-bound, that math matters a great deal.

Here's a benchmark of both GPUs in ComfyUI, a complete Docker setup guide, and a framework for deciding which GPU fits your workflow.

Why Cloud GPU for ComfyUI?

Consumer GPUs top out at 24GB VRAM (without the RTX 5090 or RTX PRO 6000). That's enough for SDXL and Flux.1 in FP8, but tight when you add ControlNets, run batch generations, or experiment with Flux.1 Dev in BF16. Most video generation models need 60-80GB and are completely out of reach on local hardware.

Speed is the second reason. A cloud H100 or RTX 5090 generates images significantly faster than a consumer RTX 3090 or 4080. If you're doing a batch run of 1,000 images for a project or testing 50 workflow variations, that throughput difference turns hours into minutes.

The third reason is flexibility. Your local machine is for other work. A cloud GPU instance runs your generation job overnight, you download the output in the morning, and you pay only for the time it was actually running. No idle hardware cost, no thermal throttling from a machine running hot for 8 hours, no interrupting your primary workstation.

Always-on workloads that used to require owning dedicated hardware (running a custom image generation API, generating training datasets, processing client projects) now cost exactly what you use and nothing more.

GPU Comparison for ComfyUI: RTX 5090 vs H100 vs RTX 4090

For image generation workloads, memory bandwidth is the primary bottleneck. Diffusion models at inference time repeatedly load large weight tensors and activations from memory; the GPU's compute units often sit waiting for data rather than computing. This is why the RTX 5090's GDDR7 bandwidth matters for ComfyUI.

All prices below are from Spheron's live marketplace as of March 13, 2026. Pricing fluctuates over time based on market supply and demand, check current GPU pricing for live rates before planning production workloads.

GPUVRAMMemory BandwidthOn-Demand $/hrSpot $/hrBest For
RTX 509032GB GDDR71,792 GB/s$0.76N/ASDXL, Flux.1 Dev FP8, most workflows
H100 PCIe80GB HBM2e2,000 GB/s$2.01N/ALarge models, video gen, heavy ControlNet stacking
H100 SXM80GB HBM33,350 GB/s$2.50$0.80Maximum speed, video generation
RTX 409024GB GDDR6X1,008 GB/s$0.58N/ABudget option, SDXL fine

The RTX 5090 vs H100 PCIe nuance: The bandwidth gap is only ~11%. That translates to roughly 10-15% fewer images per minute on the RTX 5090 compared to H100 PCIe for SDXL and Flux.1. But the H100 PCIe costs $2.01/hr vs $0.76/hr, which is 2.6x more per hour.

Result: for typical SDXL and Flux.1 generation, the RTX 5090 delivers images at roughly 40% of the H100 PCIe's cost per image. Unless you need the H100's 80GB VRAM for video models or very large ControlNet stacks, the RTX 5090 wins on cost efficiency.

The H100 SXM is the speed champion with 3,350 GB/s HBM3 bandwidth, nearly double the RTX 5090's. At $0.80/hr Spot, it's also compelling for throughput-intensive workloads where you can tolerate potential preemption. On-demand at $2.50/hr, the math shifts back toward the RTX 5090 for most image generation use cases.

For detailed RTX 5090 specs and LLM inference benchmarks, see our complete RTX 5090 rental guide.

VRAM Requirements by Model

The 32GB VRAM on the RTX 5090 covers most ComfyUI workflows. Here's what actually fits:

ModelVRAM RequiredFits on RTX 5090 (32GB)?Notes
SDXL Base8-10GB✅ YesVery comfortable
SDXL + 3 ControlNets18-22GB✅ YesFine
Flux.1 Schnell (FP8)12-16GB✅ YesFast, good quality
Flux.1 Dev (FP8)18-23GB✅ YesBest quality FP8
Flux.1 Dev (BF16)30-33GB⚠️ MarginalVery tight; use FP8 instead
SDXL + 10 LoRAs20-26GB✅ YesFine
AnimateDiff (short clips)8-16GB✅ YesSD1.5: ~8-10GB; SDXL: ~12-16GB
CogVideoX-5B24-32GB⚠️ MarginalClose; use with tiled decode
Wan 2.1 video (5s)~60-70GB BF16 (FP8: ~40-50GB)❌ NoNeeds H100; model offloading required
HunyuanVideo (720p)60GB min / 80GB rec.❌ No60GB minimum; 80GB recommended

The main limitation of the RTX 5090 for ComfyUI is video generation: Wan 2.1 requires approximately 60-70GB VRAM (BF16, with model offloading) and HunyuanVideo requires at least 60GB (80GB recommended); neither will run on 32GB. For those models, H100 PCIe (80GB) is the minimum viable option.

For broader VRAM requirements across diffusion and LLM models, see our GPU requirements cheat sheet for 2026.

Benchmark Results: Images Per Minute

These benchmarks reflect ComfyUI community data for standard workflows. Test configuration: ComfyUI with xFormers enabled, batch size 1, Ubuntu 22.04, CUDA 12.4. Throughput numbers are approximate; results vary with driver version, CUDA version, and system configuration.

SDXL 1024×1024, 20 steps:

GPUImages/minTime per imageOn-Demand $/100 imagesSpot $/100 images
RTX 5090~38~1.6s~$0.033N/A
H100 PCIe~42~1.4s~$0.080N/A
H100 SXM~60~1.0s~$0.069~$0.022
RTX 4090~28~2.1s~$0.035N/A

Flux.1 Dev 1024×1024, FP8, 20 steps:

GPUImages/minTime per imageOn-Demand $/100 imagesSpot $/100 images
RTX 5090~23~2.6s~$0.055N/A
H100 PCIe~27~2.2s~$0.124N/A
H100 SXM~38~1.6s~$0.110~$0.035
RTX 4090~15~4.0s~$0.064N/A

For SDXL, the RTX 5090 ($0.033/100 images) and RTX 4090 ($0.035/100 images) are close in cost efficiency because the speed improvement is roughly proportional to the price difference. The RTX 5090's advantage for SDXL is throughput speed (38 vs 28 img/min), better for time-sensitive workflows, and the ability to run larger, more complex setups thanks to its 32GB VRAM.

For Flux.1 Dev FP8, the RTX 5090 becomes more compelling: $0.055 per 100 images vs $0.064 for the RTX 4090 (faster and cheaper) and $0.124 for H100 PCIe (2.25x cheaper). The H100 SXM on Spot at ~$0.035 offers the best price-for-speed when availability permits.

Key takeaway: for cost-conscious image generation teams running SDXL and Flux.1, the RTX 5090 offers strong value at on-demand pricing. H100 PCIe is the premium option when your workflow exceeds 32GB VRAM, not before.

Step-by-Step Setup: ComfyUI on Spheron

Step 1: Launch a GPU instance

On Spheron, go to the GPU catalog and select RTX 5090 or H100 PCIe. Choose Ubuntu 22.04 as your OS. Do not open port 8188 in the network settings. ComfyUI has no built-in authentication, so exposing it publicly would let anyone on the internet access your instance. You will access it securely via an SSH tunnel instead (Step 4).

Step 2: Pull and run the ComfyUI Docker image

SSH into your instance, then run:

bash
# Using latest-cuda pulls the most recent CUDA-enabled build of the ai-dock/comfyui image.
# This tag is a floating tag; the image author can push updates to it at any time.
# For a stronger supply-chain guarantee, pin by digest instead of tag:
#   docker pull ghcr.io/ai-dock/comfyui:latest-cuda
#   docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/ai-dock/comfyui:latest-cuda
# Then replace the IMAGE value below with the returned sha256 digest reference.
# Check https://github.com/ai-dock/comfyui/pkgs/container/comfyui for available tags.
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda

docker pull $IMAGE

docker run -d \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8188:8188 \
  -v ~/comfyui-models:/opt/ComfyUI/models \
  -v ~/comfyui-output:/opt/ComfyUI/output \
  $IMAGE

The -v flags persist your model files and outputs across container restarts. --ipc=host is required for the shared memory that PyTorch uses internally.

Step 3: Download models

bash
# Create model directories
mkdir -p ~/comfyui-models/checkpoints ~/comfyui-models/vae

# Download SDXL base checkpoint (~6.5GB)
wget -O ~/comfyui-models/checkpoints/sd_xl_base_1.0.safetensors \
  https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors

# Download Flux.1 Dev FP8 (~11GB) from community repo (no auth required)
# The official black-forest-labs/FLUX.1-dev repo contains only the BF16 weights.
# FP8-quantized checkpoints are provided by the community via Kijai/flux-fp8.
pip install huggingface_hub
huggingface-cli download Kijai/flux-fp8 \
  --include "flux1-dev-fp8-e4m3fn.safetensors" \
  --local-dir ~/comfyui-models/checkpoints

For the BF16 variant (flux1-dev.safetensors), you need a HuggingFace account and must accept the model license at huggingface.co/black-forest-labs/FLUX.1-dev before downloading from the official repo.

Step 4: Access ComfyUI via SSH tunnel

Because ComfyUI has no built-in authentication, you access it through an SSH tunnel rather than opening the port directly to the internet. Run this on your local machine:

bash
ssh -L 8188:localhost:8188 user@your-server-ip

Replace user with your instance username and your-server-ip with the public IP shown in the Spheron dashboard. While the tunnel is open, navigate to http://localhost:8188 in your browser. The ComfyUI node graph interface will load. Traffic is encrypted in transit and the port is never exposed publicly.

Step 5: Load a workflow and generate

ComfyUI uses workflow JSON files. Load one by dragging a .json file onto the canvas, or use the Load button in the top menu. Community workflow sources include comfyworkflows.com and OpenArt.ai. The ai-dock container includes a default SDXL txt2img workflow to get you started immediately.

Optimizing ComfyUI Speed on Cloud GPU

Enable FP8 for Flux.1

Use the FP8 checkpoint loader node instead of the standard CheckpointLoaderSimple for Flux.1. FP8 native Tensor Core support is available on NVIDIA's 4th-generation Tensor Cores, present in both Hopper (H100) and Ada Lovelace (RTX 4090, L40S) architectures, and continues with Blackwell's 5th-generation (RTX 5090). All three GPUs in this comparison accelerate FP8 inference in hardware. This reduces VRAM usage by ~40% and speeds up generation compared to BF16. The RTX 5090 additionally supports FP4 (Blackwell-only), but current ComfyUI and Flux.1 tooling does not yet leverage FP4. Load your flux1-dev-fp8-e4m3fn.safetensors file directly; no conversion needed.

Use TAESD for live previews

Tiny AutoEncoder for SD (TAESD) generates preview images in ComfyUI's node graph during generation without decoding the full VAE at each step. This significantly speeds up preview refresh and reduces VRAM pressure during long generations. Add a TAESDDecode node and connect it to the latent output of your sampler.

Set high VRAM mode

In ComfyUI settings (gear icon → top right), enable "Force High VRAM" mode. This keeps model weights loaded in VRAM between generation runs rather than unloading them. On RTX 5090 and H100 with dedicated VRAM, this eliminates per-run model loading overhead and meaningfully speeds up sequential generation.

Tile large images

For outputs above 1024×1024, use tiled VAE decode to avoid VRAM spikes during the final decode step. The VAEDecodeTiled node splits the latent into tiles, decodes them separately, and stitches the result. This enables 2048×2048 and larger outputs on 32GB VRAM without running out of memory.

Batch generation

Use the KSampler node's batch size parameter to generate multiple images per run. Batching is significantly more GPU-efficient than sequential single-image runs because it amortizes model loading and CUDA kernel launch overhead. On an RTX 5090, a batch of 4 images takes roughly 2.5x the time of a single image (not 4x), so throughput per hour increases substantially.

Video Generation on Cloud GPU

Most video generation models in ComfyUI need 60GB+ VRAM at standard quality settings, well beyond the RTX 5090's 32GB. For video workflows, H100 PCIe or H100 SXM is the minimum practical option.

ModelVRAM RequiredRecommended GPUApprox. Gen Time (5s, 720p)
AnimateDiff (16 frames)8-16GBRTX 5090~3-5 min
CogVideoX-5B24-32GBRTX 5090~5-8 min
Wan 2.1 (5 seconds)~60-70GB BF16 (FP8: ~40-50GB)H100 PCIe~8-12 min
HunyuanVideo (720p)60GB min / 80GB rec.H100 PCIe~20-30 min

AnimateDiff and CogVideoX-5B are workable on the RTX 5090 for short clips. For Wan 2.1 or HunyuanVideo (the models generating the most visually impressive results for video in 2026), you need H100 PCIe's 80GB. Explore H100 GPU rental for video generation workloads.

Cost Comparison: Running ComfyUI on Spheron vs RunPod vs Local

Pricing as of March 13, 2026. GPU pricing fluctuates over time based on supply and availability; verify current rates before planning production workloads.

PlatformGPUOn-Demand $/hr1,000 SDXL images (approx)
SpheronRTX 5090$0.76~$0.33
SpheronH100 PCIe$2.01~$0.80
SpheronH100 SXM (Spot)$0.80~$0.22
RunPodRTX 4090~$0.74~$0.44
RunPodH100 PCIe~$2.49~$1.00
Local RTX 4090N/A~$0.05 (electricity only)~$0.03

The local machine cost assumes $0.12/kWh electricity with a 450W load and excludes hardware amortization. Factor in GPU purchase cost and it changes the math significantly for infrequent workloads.

For occasional large batch jobs (1,000-10,000 images for a project), cloud GPU is generally cheaper than owning hardware when you account for idle time. If you're generating images 8+ hours per day continuously, local hardware starts to compete on total cost.

Spheron's RTX 5090 at $0.76/hr is priced competitively against RunPod's RTX 4090 at ~$0.74/hr, while delivering 36% faster generation and 8GB more VRAM. For teams currently on RunPod's RTX 4090 for image generation, Spheron's RTX 5090 gives you meaningfully better throughput at a comparable hourly rate.

For a full comparison between Spheron and RunPod across GPU options and features, see our Spheron vs RunPod comparison.


Run ComfyUI on Spheron's RTX 5090 or H100: deploy in minutes, no monthly subscriptions, shut down when your generation job is done.

Get an RTX 5090 for ComfyUI →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.