Can the RTX 5090 run Flux.1 Dev in ComfyUI?

Yes. Flux.1 Dev in FP8 quantization requires 18-23GB of VRAM, well within the RTX 5090's 32GB. In BF16 precision, Flux.1 Dev uses around 30-33GB, which is marginal; use FP8 for reliable operation. The RTX 5090's GDDR7 bandwidth (1,792 GB/s) makes it fast enough for production image generation.

Is the RTX 5090 or H100 better for ComfyUI?

For most workflows (SDXL, Flux.1 Dev FP8, ControlNet stacking up to 3-4 nets), the RTX 5090 delivers similar throughput to the H100 PCIe at roughly one-third the hourly cost. The H100 only wins when you need its 80GB VRAM for video generation models (Wan 2.1, HunyuanVideo) or very heavy ControlNet stacking.

How do I run ComfyUI on a cloud GPU?

Pull the ghcr.io/ai-dock/comfyui:latest-cuda Docker image and run it with --gpus all and -p 127.0.0.1:8188:8188 (bind to localhost only; ComfyUI has no authentication). On Spheron, launch an RTX 5090 or H100 instance and run the Docker command without opening port 8188 in the network settings. Access the UI securely via SSH tunnel: ssh -L 8188:localhost:8188 user@your-server-ip, then open http://localhost:8188 in your browser. Full setup takes under 10 minutes. Note on image pinning: latest-cuda is a floating tag. For the strongest supply-chain guarantee, pin by digest after pulling: docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/ai-dock/comfyui:latest-cuda, then use ghcr.io/ai-dock/comfyui@sha256: .

What is the cost per 100 images on Spheron for Stable Diffusion?

For SDXL 1024x1024 at 20 steps: approximately $0.033 per 100 images on an RTX 5090 ($0.76/hr), $0.035 on an RTX 4090 ($0.58/hr), and $0.080 on an H100 PCIe ($2.01/hr). The RTX 5090 generates images 36% faster than the RTX 4090 at a similar per-image cost, while H100 PCIe costs 2.4x more per 100 images.

Can I run video generation models in ComfyUI on Spheron?

Yes, but you need an H100 for most video models. Wan 2.1 (14B, BF16) requires approximately 60-70GB of VRAM at 720p (with model offloading enabled); FP8 quantization reduces this to roughly 40-50GB. HunyuanVideo requires a minimum of 60GB at 720p (80GB recommended for comfortable operation). Both exceed the RTX 5090's 32GB. H100 PCIe (80GB) is the minimum practical option for Wan 2.1 and HunyuanVideo. AnimateDiff (short clips) and CogVideoX-5B can run on the RTX 5090.

ComfyUI on GPU Cloud 2026: RTX 5090 vs H100 for Stable Diffusion & Flux

You want to run ComfyUI in the cloud to generate SDXL or Flux.1 images at scale, faster than your local GPU and without burning your own hardware 24/7. On Spheron, you can provision an RTX 5090 at $0.76/hr or an H100 at $2.01/hr on demand, run your generation jobs, and shut everything down when you're done.

The interesting comparison in 2026 isn't H100 vs A100. It's RTX 5090 vs H100 PCIe. These two GPUs have surprisingly similar memory bandwidth: GDDR7 at 1,792 GB/s vs HBM2e at 2,000 GB/s, an 11% gap. But the price difference is 2.6x. For diffusion model inference, which is largely memory-bandwidth-bound, that math matters a great deal.

Here's a benchmark of both GPUs in ComfyUI, a complete Docker setup guide, and a framework for deciding which GPU fits your workflow.

Why Cloud GPU for ComfyUI?

Consumer GPUs top out at 24GB VRAM (without the RTX 5090 or RTX PRO 6000). That's enough for SDXL and Flux.1 in FP8, but tight when you add ControlNets, run batch generations, or experiment with Flux.1 Dev in BF16. Most video generation models need 60-80GB and are completely out of reach on local hardware.

Speed is the second reason. A cloud H100 or RTX 5090 generates images significantly faster than a consumer RTX 3090 or 4080. If you're doing a batch run of 1,000 images for a project or testing 50 workflow variations, that throughput difference turns hours into minutes.

The third reason is flexibility. Your local machine is for other work. A cloud GPU instance runs your generation job overnight, you download the output in the morning, and you pay only for the time it was actually running. No idle hardware cost, no thermal throttling from a machine running hot for 8 hours, no interrupting your primary workstation.

Always-on workloads that used to require owning dedicated hardware (running a custom image generation API, generating training datasets, processing client projects) now cost exactly what you use and nothing more.

GPU Comparison for ComfyUI: RTX 5090 vs H100 vs RTX 4090

For image generation workloads, memory bandwidth is the primary bottleneck. Diffusion models at inference time repeatedly load large weight tensors and activations from memory; the GPU's compute units often sit waiting for data rather than computing. This is why the RTX 5090's GDDR7 bandwidth matters for ComfyUI.

All prices below are from Spheron's live marketplace as of March 13, 2026. Pricing fluctuates over time based on market supply and demand, check current GPU pricing for live rates before planning production workloads.

GPU	VRAM	Memory Bandwidth	On-Demand $/hr	Spot $/hr	Best For
RTX 5090	32GB GDDR7	1,792 GB/s	$0.76	N/A	SDXL, Flux.1 Dev FP8, most workflows
H100 PCIe	80GB HBM2e	2,000 GB/s	$2.01	N/A	Large models, video gen, heavy ControlNet stacking
H100 SXM	80GB HBM3	3,350 GB/s	$2.50	$0.80	Maximum speed, video generation
RTX 4090	24GB GDDR6X	1,008 GB/s	$0.58	N/A	Budget option, SDXL fine

The RTX 5090 vs H100 PCIe nuance: The bandwidth gap is only ~11%. That translates to roughly 10-15% fewer images per minute on the RTX 5090 compared to H100 PCIe for SDXL and Flux.1. But the H100 PCIe costs $2.01/hr vs $0.76/hr, which is 2.6x more per hour.

Result: for typical SDXL and Flux.1 generation, the RTX 5090 delivers images at roughly 40% of the H100 PCIe's cost per image. Unless you need the H100's 80GB VRAM for video models or very large ControlNet stacks, the RTX 5090 wins on cost efficiency.

The H100 SXM is the speed champion with 3,350 GB/s HBM3 bandwidth, nearly double the RTX 5090's. At $0.80/hr Spot, it's also compelling for throughput-intensive workloads where you can tolerate potential preemption. On-demand at $2.50/hr, the math shifts back toward the RTX 5090 for most image generation use cases.

For detailed RTX 5090 specs and LLM inference benchmarks, see our complete RTX 5090 rental guide.

VRAM Requirements by Model

The 32GB VRAM on the RTX 5090 covers most ComfyUI workflows. Here's what actually fits:

Model	VRAM Required	Fits on RTX 5090 (32GB)?	Notes
SDXL Base	8-10GB	✅ Yes	Very comfortable
SDXL + 3 ControlNets	18-22GB	✅ Yes	Fine
Flux.1 Schnell (FP8)	12-16GB	✅ Yes	Fast, good quality
Flux.1 Dev (FP8)	18-23GB	✅ Yes	Best quality FP8
Flux.1 Dev (BF16)	30-33GB	⚠️ Marginal	Very tight; use FP8 instead
SDXL + 10 LoRAs	20-26GB	✅ Yes	Fine
AnimateDiff (short clips)	8-16GB	✅ Yes	SD1.5: ~8-10GB; SDXL: ~12-16GB
CogVideoX-5B	24-32GB	⚠️ Marginal	Close; use with tiled decode
Wan 2.1 video (5s)	~60-70GB BF16 (FP8: ~40-50GB)	❌ No	Needs H100; model offloading required
HunyuanVideo (720p)	60GB min / 80GB rec.	❌ No	60GB minimum; 80GB recommended

The main limitation of the RTX 5090 for ComfyUI is video generation: Wan 2.1 requires approximately 60-70GB VRAM (BF16, with model offloading) and HunyuanVideo requires at least 60GB (80GB recommended); neither will run on 32GB. For those models, H100 PCIe (80GB) is the minimum viable option.

For broader VRAM requirements across diffusion and LLM models, see our GPU requirements cheat sheet for 2026.

Benchmark Results: Images Per Minute

These benchmarks reflect ComfyUI community data for standard workflows. Test configuration: ComfyUI with xFormers enabled, batch size 1, Ubuntu 22.04, CUDA 12.4. Throughput numbers are approximate; results vary with driver version, CUDA version, and system configuration.

SDXL 1024×1024, 20 steps:

GPU	Images/min	Time per image	On-Demand $/100 images	Spot $/100 images
RTX 5090	~38	~1.6s	~$0.033	N/A
H100 PCIe	~42	~1.4s	~$0.080	N/A
H100 SXM	~60	~1.0s	~$0.069	~$0.022
RTX 4090	~28	~2.1s	~$0.035	N/A

Flux.1 Dev 1024×1024, FP8, 20 steps:

GPU	Images/min	Time per image	On-Demand $/100 images	Spot $/100 images
RTX 5090	~23	~2.6s	~$0.055	N/A
H100 PCIe	~27	~2.2s	~$0.124	N/A
H100 SXM	~38	~1.6s	~$0.110	~$0.035
RTX 4090	~15	~4.0s	~$0.064	N/A

For SDXL, the RTX 5090 ($0.033/100 images) and RTX 4090 ($0.035/100 images) are close in cost efficiency because the speed improvement is roughly proportional to the price difference. The RTX 5090's advantage for SDXL is throughput speed (38 vs 28 img/min), better for time-sensitive workflows, and the ability to run larger, more complex setups thanks to its 32GB VRAM.

For Flux.1 Dev FP8, the RTX 5090 becomes more compelling: $0.055 per 100 images vs $0.064 for the RTX 4090 (faster and cheaper) and $0.124 for H100 PCIe (2.25x cheaper). The H100 SXM on Spot at ~$0.035 offers the best price-for-speed when availability permits.

Key takeaway: for cost-conscious image generation teams running SDXL and Flux.1, the RTX 5090 offers strong value at on-demand pricing. H100 PCIe is the premium option when your workflow exceeds 32GB VRAM, not before.

Step-by-Step Setup: ComfyUI on Spheron

Step 1: Launch a GPU instance

On Spheron, go to the GPU catalog and select RTX 5090 or H100 PCIe. Choose Ubuntu 22.04 as your OS. Do not open port 8188 in the network settings. ComfyUI has no built-in authentication, so exposing it publicly would let anyone on the internet access your instance. You will access it securely via an SSH tunnel instead (Step 4).

Step 2: Pull and run the ComfyUI Docker image

SSH into your instance, then run:

bash

# Using latest-cuda pulls the most recent CUDA-enabled build of the ai-dock/comfyui image.
# This tag is a floating tag; the image author can push updates to it at any time.
# For a stronger supply-chain guarantee, pin by digest instead of tag:
#   docker pull ghcr.io/ai-dock/comfyui:latest-cuda
#   docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/ai-dock/comfyui:latest-cuda
# Then replace the IMAGE value below with the returned sha256 digest reference.
# Check https://github.com/ai-dock/comfyui/pkgs/container/comfyui for available tags.
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda

docker pull $IMAGE

docker run -d \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8188:8188 \
  -v ~/comfyui-models:/opt/ComfyUI/models \
  -v ~/comfyui-output:/opt/ComfyUI/output \
  $IMAGE

The -v flags persist your model files and outputs across container restarts. --ipc=host is required for the shared memory that PyTorch uses internally.

Step 3: Download models

bash

# Create model directories
mkdir -p ~/comfyui-models/checkpoints ~/comfyui-models/vae

# Download SDXL base checkpoint (~6.5GB)
wget -O ~/comfyui-models/checkpoints/sd_xl_base_1.0.safetensors \
  https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors

# Download Flux.1 Dev FP8 (~11GB) from community repo (no auth required)
# The official black-forest-labs/FLUX.1-dev repo contains only the BF16 weights.
# FP8-quantized checkpoints are provided by the community via Kijai/flux-fp8.
pip install huggingface_hub
huggingface-cli download Kijai/flux-fp8 \
  --include "flux1-dev-fp8-e4m3fn.safetensors" \
  --local-dir ~/comfyui-models/checkpoints

For the BF16 variant (flux1-dev.safetensors), you need a HuggingFace account and must accept the model license at huggingface.co/black-forest-labs/FLUX.1-dev before downloading from the official repo.

Step 4: Access ComfyUI via SSH tunnel

Because ComfyUI has no built-in authentication, you access it through an SSH tunnel rather than opening the port directly to the internet. Run this on your local machine:

bash

ssh -L 8188:localhost:8188 user@your-server-ip

Replace user with your instance username and your-server-ip with the public IP shown in the Spheron dashboard. While the tunnel is open, navigate to http://localhost:8188 in your browser. The ComfyUI node graph interface will load. Traffic is encrypted in transit and the port is never exposed publicly.

Step 5: Load a workflow and generate

ComfyUI uses workflow JSON files. Load one by dragging a .json file onto the canvas, or use the Load button in the top menu. Community workflow sources include comfyworkflows.com and OpenArt.ai. The ai-dock container includes a default SDXL txt2img workflow to get you started immediately.

Optimizing ComfyUI Speed on Cloud GPU

Enable FP8 for Flux.1

Use the FP8 checkpoint loader node instead of the standard CheckpointLoaderSimple for Flux.1. FP8 native Tensor Core support is available on NVIDIA's 4th-generation Tensor Cores, present in both Hopper (H100) and Ada Lovelace (RTX 4090, L40S) architectures, and continues with Blackwell's 5th-generation (RTX 5090). All three GPUs in this comparison accelerate FP8 inference in hardware. This reduces VRAM usage by ~40% and speeds up generation compared to BF16. The RTX 5090 additionally supports FP4 (Blackwell-only), but current ComfyUI and Flux.1 tooling does not yet leverage FP4. Load your flux1-dev-fp8-e4m3fn.safetensors file directly; no conversion needed.

Use TAESD for live previews

Tiny AutoEncoder for SD (TAESD) generates preview images in ComfyUI's node graph during generation without decoding the full VAE at each step. This significantly speeds up preview refresh and reduces VRAM pressure during long generations. Add a TAESDDecode node and connect it to the latent output of your sampler.

Set high VRAM mode

In ComfyUI settings (gear icon → top right), enable "Force High VRAM" mode. This keeps model weights loaded in VRAM between generation runs rather than unloading them. On RTX 5090 and H100 with dedicated VRAM, this eliminates per-run model loading overhead and meaningfully speeds up sequential generation.

Tile large images

For outputs above 1024×1024, use tiled VAE decode to avoid VRAM spikes during the final decode step. The VAEDecodeTiled node splits the latent into tiles, decodes them separately, and stitches the result. This enables 2048×2048 and larger outputs on 32GB VRAM without running out of memory.

Batch generation

Use the KSampler node's batch size parameter to generate multiple images per run. Batching is significantly more GPU-efficient than sequential single-image runs because it amortizes model loading and CUDA kernel launch overhead. On an RTX 5090, a batch of 4 images takes roughly 2.5x the time of a single image (not 4x), so throughput per hour increases substantially.

Video Generation on Cloud GPU

Most video generation models in ComfyUI need 60GB+ VRAM at standard quality settings, well beyond the RTX 5090's 32GB. For video workflows, H100 PCIe or H100 SXM is the minimum practical option.

Model	VRAM Required	Recommended GPU	Approx. Gen Time (5s, 720p)
AnimateDiff (16 frames)	8-16GB	RTX 5090	~3-5 min
CogVideoX-5B	24-32GB	RTX 5090	~5-8 min
Wan 2.1 (5 seconds)	~60-70GB BF16 (FP8: ~40-50GB)	H100 PCIe	~8-12 min
HunyuanVideo (720p)	60GB min / 80GB rec.	H100 PCIe	~20-30 min

AnimateDiff and CogVideoX-5B are workable on the RTX 5090 for short clips. For Wan 2.1 or HunyuanVideo (the models generating the most visually impressive results for video in 2026), you need H100 PCIe's 80GB. Explore H100 GPU rental for video generation workloads.

Cost Comparison: Running ComfyUI on Spheron vs RunPod vs Local

Pricing as of March 13, 2026. GPU pricing fluctuates over time based on supply and availability; verify current rates before planning production workloads.

Platform	GPU	On-Demand $/hr	1,000 SDXL images (approx)
Spheron	RTX 5090	$0.76	~$0.33
Spheron	H100 PCIe	$2.01	~$0.80
Spheron	H100 SXM (Spot)	$0.80	~$0.22
RunPod	RTX 4090	~$0.74	~$0.44
RunPod	H100 PCIe	~$2.49	~$1.00
Local RTX 4090	N/A	~$0.05 (electricity only)	~$0.03

The local machine cost assumes $0.12/kWh electricity with a 450W load and excludes hardware amortization. Factor in GPU purchase cost and it changes the math significantly for infrequent workloads.

For occasional large batch jobs (1,000-10,000 images for a project), cloud GPU is generally cheaper than owning hardware when you account for idle time. If you're generating images 8+ hours per day continuously, local hardware starts to compete on total cost.

Spheron's RTX 5090 at $0.76/hr is priced competitively against RunPod's RTX 4090 at ~$0.74/hr, while delivering 36% faster generation and 8GB more VRAM. For teams currently on RunPod's RTX 4090 for image generation, Spheron's RTX 5090 gives you meaningfully better throughput at a comparable hourly rate.

For a full comparison between Spheron and RunPod across GPU options and features, see our Spheron vs RunPod comparison.

Run ComfyUI on Spheron's RTX 5090 or H100: deploy in minutes, no monthly subscriptions, shut down when your generation job is done.
Get an RTX 5090 for ComfyUI →

Why Cloud GPU for ComfyUI?

GPU Comparison for ComfyUI: RTX 5090 vs H100 vs RTX 4090

VRAM Requirements by Model

Benchmark Results: Images Per Minute

Step-by-Step Setup: ComfyUI on Spheron

Optimizing ComfyUI Speed on Cloud GPU

Video Generation on Cloud GPU

Cost Comparison: Running ComfyUI on Spheron vs RunPod vs Local

Build what's next.