Deploy FLUX.2 on GPU Cloud: Production Image Generation Setup Guide (2026)

FLUX.2 from Black Forest Labs is the first production-grade diffusion transformer with reliable text rendering in generated images, and the question teams now face is which GPU to run it on and at what cost. This guide covers VRAM requirements, FP8 vs GGUF quantization trade-offs, Docker-based ComfyUI setup, diffusers-based API deployment, and a direct cost comparison against Runpod and Replicate for H100 and A100 configurations.

What is FLUX.2?

FLUX.2 is a 32 billion parameter rectified flow transformer from Black Forest Labs. Unlike Stable Diffusion's U-Net design, FLUX.2 uses a transformer backbone throughout. The practical difference is prompt comprehension and text rendering: FLUX.2-dev uses a Mistral Small 3.1 (24B) multimodal vision-language encoder, which gives it substantially better handling of complex, long prompts compared to earlier models, and the transformer architecture makes embedded text in generated images work reliably. If you've tried generating images with text in them using SDXL or earlier Flux, you know how broken it was.

Three variants are relevant for self-hosted inference. FLUX.2 Dev is the full 32B guidance-distilled model, targeting 20-50 inference steps and offering the highest output quality. FLUX.2-klein-9B and FLUX.2-klein-4B are distilled variants step-distilled to 4 inference steps for sub-second generation. The klein models require far less VRAM (29GB and 13GB respectively) and suit latency-critical workloads or lower-budget GPU setups. For production image generation APIs where quality matters, Dev is the right choice. For high-throughput pipelines where cost or latency dominates, the klein models cut compute significantly.

FLUX.2 Dev is available under the FLUX Non-Commercial License, which requires accepting terms on HuggingFace. Commercial production use requires either the Pro API or a separate license from BFL. FLUX.2-klein-4B uses the Apache 2.0 license. FLUX.2-klein-9B uses the FLUX Non-Commercial License; verify terms on HuggingFace before commercial deployment.

For a comparison of H100, H200, and B200 for inference workloads broadly, see Best GPU for AI Inference in 2026.

VRAM Requirements and Quantization

VRAM is the first practical constraint. FLUX.2-dev is a 32B model, which makes it considerably heavier than FLUX.1. The model has two components that consume VRAM simultaneously: the transformer backbone and the text encoder stack. Both must fit in VRAM at the same time.

File sizes below are taken from the city96/FLUX.2-dev-gguf repository, which reflects actual disk (and VRAM) usage.

Variant	Quantization	File Size / VRAM	Quality Loss	Minimum GPU
FLUX.2 Dev	BF16	~64GB	None (reference)	Needs >80GB; use FP8 instead
FLUX.2 Dev	FP8	~32GB	Minimal (perceptually transparent)	H100 PCIe (80GB), A100 80G
FLUX.2 Dev	GGUF Q8_0	~35GB	Minimal	H100 PCIe (80GB), A100 80G
FLUX.2 Dev	GGUF Q5_K_M	~24GB	Low	A100 PCIe 80G
FLUX.2 Dev	GGUF Q4_K_S	~19GB	Moderate (softened fine detail)	RTX 4090 (24GB)
FLUX.2 Dev	GGUF Q2_K	~13GB	High	RTX 4090 (24GB)
FLUX.2-klein-9B	FP8/BF16	~29GB	N/A (distilled)	H100 PCIe, A100 80G
FLUX.2-klein-4B	FP8/BF16	~13GB	N/A (distilled)	RTX 4090 and above

BF16 at 64GB is too large for any single 80GB GPU when you add text encoder overhead and activation memory. Use FP8 (~32GB) on H100 or A100 80G. RTX 4090 users should use GGUF Q4_K_S (~19GB) or FLUX.2-klein-4B (~13GB) rather than FP8 or Q8_0; neither fits in 24GB.

FP8 uses hardware Tensor Cores natively on Hopper (H100) and Ada Lovelace (RTX 4090). On Ampere (A100), FP8 is emulated in software and may be slower than BF16 for some workloads. If you're running on A100, test BF16 and FP8 side-by-side before committing to FP8 in production. The A100's 80GB VRAM makes FP8 and Q8_0 comfortable.

GGUF quantization (via a diffusion-specific llama.cpp port) keeps the transformer on GPU but offloads the text encoder to CPU. This reduces peak VRAM but introduces CPU overhead for the text encoding step. For low-VRAM situations it's useful; for production throughput on H100 or A100, stick with FP8.

The practical recommendation: FP8 on H100 PCIe or A100 80G, GGUF Q4_K_S on RTX 4090, FLUX.2-klein-4B for sub-second latency use cases or GPUs below 20GB VRAM. See the RTX 4090 for AI/ML guide for more on what the 4090 can handle at different quantizations. For a full GPU selection matrix comparing RTX 4090 through H200 on cost per image across SDXL, Flux.1, and Flux.2, see best GPU for AI image generation 2026.

GPU Options for FLUX.2 on Spheron

Pricing below is from Spheron's live marketplace as of 19 Apr 2026. GPU pricing fluctuates based on availability. Check current GPU pricing → for live rates.

GPU	VRAM	On-Demand $/hr	Spot $/hr	Best FLUX.2 Use Case
B200 SXM6	192GB HBM3e	N/A	$2.06	Batch FP8, maximum throughput (spot only)
H100 SXM5	80GB HBM3	$2.90	N/A	FP8, high throughput
H100 PCIe	80GB HBM2e	$2.01	N/A	FP8 workhorse GPU
A100 SXM4 80G	80GB HBM2	$1.65	$0.45	FP8/BF16 budget option, spot ideal
A100 PCIe 80G	80GB HBM2	$1.04	$1.14†	FP8/GGUF Q8_0, on-demand is cheapest option
RTX 4090	24GB GDDR6X	$0.79	N/A	GGUF Q4_K or FLUX.2-klein, solo inference

†A100 PCIe 80G spot is currently priced higher than on-demand ($1.14 vs $1.04/hr). For this GPU, on-demand is the cheaper choice. Spot pricing can shift, so check current pricing → before provisioning.

H100 PCIe at $2.01/hr is the workhorse for FP8 FLUX.2-dev. 80GB VRAM fits FP8 comfortably at 32GB. Note: BF16 (~64GB weights plus encoder) does not fit in 80GB in practice. A100 SXM4 80G at $1.65/hr on-demand and $0.45/hr spot is the value pick for teams that can tolerate spot preemption. FP8 fits comfortably on the A100's 80GB HBM2, and BF16 is possible with text encoder CPU offloading. See the A100 GPU rental guide for more deployment details.

RTX 4090 at $0.79/hr works for GGUF Q4_K_S (~19GB) or FLUX.2-klein-4B (~13GB). FP8 (~32GB) and Q8_0 (~35GB) both exceed its 24GB GDDR6X. B200 SXM6 at $2.06/hr spot fits large-batch or highest-throughput scenarios such as dataset generation pipelines. It is currently available spot-only. See the NVIDIA B200 complete guide for B200 specs and workload fit.

Rent an H100 → | View all GPU pricing →

Setup Guide: ComfyUI Deployment

ComfyUI on Docker is the fastest path to running FLUX.2 interactively. This covers FLUX.2 weights specifically, not Flux.1.

Step 1: Launch a Spheron GPU instance. H100 PCIe or A100 80G is recommended for FP8 or GGUF Q8_0 (~32-35GB VRAM each). RTX 4090 works for GGUF Q4_K_S (~19GB). BF16 (~64GB) is not practical on any single 80GB GPU for FLUX.2-dev. Select Ubuntu 22.04 as the base image. For a full walkthrough of account setup and instance deployment, see the Spheron getting started guide.

Step 2: Pull and run the ComfyUI container:

bash

IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda

docker pull $IMAGE
docker run -d \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8188:8188 \
  -v ~/comfyui-models:/opt/ComfyUI/models \
  -v ~/comfyui-output:/opt/ComfyUI/output \
  $IMAGE

Binding to 127.0.0.1 keeps the port off the public interface. Do not bind to 0.0.0.0.

Step 3: Download FLUX.2 weights. FLUX.2 Dev requires a HuggingFace account and license acceptance before the download will succeed.

bash

pip install huggingface_hub
huggingface-cli login   # paste your HF token; required for FLUX.2 Dev

# BF16 weights (~64GB); requires 80GB+ GPU, FP8 is recommended instead
huggingface-cli download black-forest-labs/FLUX.2-dev \
  --include "flux2-dev.safetensors" \
  --local-dir ~/comfyui-models/checkpoints

# GGUF Q8_0 weights (~35GB, requires H100 PCIe or A100 80G; no license gate)
huggingface-cli download city96/FLUX.2-dev-gguf \
  --include "flux2-dev-Q8_0.gguf" \
  --local-dir ~/comfyui-models/checkpoints

# GGUF Q4_K_S weights (~19GB, for RTX 4090 and lower-VRAM GPUs)
huggingface-cli download city96/FLUX.2-dev-gguf \
  --include "flux2-dev-Q4_K_S.gguf" \
  --local-dir ~/comfyui-models/checkpoints

Step 4: Connect via SSH tunnel:

bash

ssh -L 8188:localhost:8188 user@your-server-ip

Then open http://localhost:8188 in your browser.

Step 5: Load a FLUX.2 workflow. Use UnetLoaderGGUF for GGUF variants or CheckpointLoaderSimple for safetensors. FLUX.2-dev uses a single Mistral-3 24B vision-language encoder (one encoder node in ComfyUI, not the dual T5-XXL + CLIP-L stack used by FLUX.1 and SDXL). FLUX.2-klein variants use a Qwen3 encoder instead. Check the model card at https://huggingface.co/black-forest-labs/FLUX.2-dev for the exact architecture before building custom workflows.

For a full introduction to ComfyUI setup on cloud GPU including SSH tunneling and security configuration, see ComfyUI on GPU Cloud 2026.

Setup Guide: diffusers Library for Production APIs

For teams building an image generation API rather than using a GUI, the diffusers library integrates cleanly with FastAPI and supports batching.

python

import torch
from diffusers import Flux2Pipeline
from optimum.quanto import freeze, qfloat8, quantize

# Flux2Pipeline is the correct class for FLUX.2-dev (available in diffusers >= 0.33).
# For FLUX.2-klein, use Flux2KleinPipeline instead.
# FluxPipeline is FLUX.1 only and will not work with FLUX.2-dev weights.
#
# BF16 loads ~64GB of weights plus text encoder overhead, too large for a single 80GB GPU.
# Use FP8 quantization to bring VRAM usage to ~32GB, which fits H100 PCIe and A100 80G.
# Install: pip install optimum-quanto
pipe = Flux2Pipeline.from_pretrained(
    "black-forest-labs/FLUX.2-dev",
    torch_dtype=torch.bfloat16,
)
quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)
pipe = pipe.to("cuda")

# Optional: torch.compile for sustained throughput
# First call takes 3-5 minutes to compile; subsequent calls are 30-50% faster
# Only worth using for sustained workloads, not single-shot generation
pipe.transformer = torch.compile(
    pipe.transformer,
    mode="reduce-overhead",
    fullgraph=True,
)

image = pipe(
    prompt="A photograph of a coffee mug with the text 'Monday' printed on it",
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=3.5,
    max_sequence_length=512,
).images[0]

image.save("output.png")

A few notes on the options:

FP8 loading: The example above uses optimum-quanto to quantize the transformer to FP8 after loading. This is the recommended path for H100 PCIe and A100 80G. On H100 and RTX 4090, hardware FP8 Tensor Cores give a real speedup. On A100, the speedup is smaller (FP8 is emulated in software on Ampere), but VRAM savings still apply. Alternatively, BitsAndBytesConfig(load_in_8bit=True) achieves a similar result.

Sequential CPU offload: pipe.enable_sequential_cpu_offload() allows FLUX.2 BF16 to run on 16GB VRAM by moving tensors to CPU between steps. Latency goes from seconds to minutes. Useful for experimentation, not production.

torch.compile warmup: The first inference call after enabling torch.compile takes 3-5 minutes while PyTorch traces and compiles the computation graph. After that, each subsequent call is 30-50% faster. If you restart the process, it compiles again. Only enable this for processes that will run many thousands of images.

Production Inference: Batching and Throughput

All values are approximate. Configuration: Ubuntu 22.04, CUDA 12.4, diffusers 0.33+, 1024x1024 output, 28 steps.

GPU	Precision	Images/min (batch=1)	Images/min (batch=4)	$/100 images (on-demand)	$/100 images (spot)
H100 SXM5	FP8	~18	~55	~$0.27	N/A
H100 PCIe	FP8	~14	~42	~$0.24	N/A
A100 SXM4 80G	FP8	~8	~22	~$0.34	~$0.09
A100 SXM4 80G	BF16 (CPU offload)	~7	~18	~$0.39	~$0.11
A100 PCIe 80G	FP8	~8	~22	~$0.22	~$0.24†
RTX 4090	GGUF Q4_K	~4	N/A	~$0.33	N/A

†A100 PCIe 80G spot ($1.14/hr) is currently more expensive than on-demand ($1.04/hr), so spot costs more per image. On-demand is the cheaper option for this GPU until spot pricing falls below on-demand.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

H100 PCIe BF16 is omitted: at 32B parameters, BF16 (~64GB) does not fit in 80GB alongside encoder overhead. Use FP8 (~32GB) on H100 PCIe. The A100 SXM4 80G table includes both precisions: FP8 at ~8 images/min (recommended) and BF16 with CPU text encoder offloading at ~7 images/min (slower due to CPU overhead, but avoids the quantization step). RTX 4090 uses GGUF Q4_K (~19GB) since FP8 and Q8_0 exceed its 24GB VRAM. RTX 4090 batch=4 is excluded due to limited headroom above the Q4_K model footprint.

FLUX.2 at batch=4 is not 4x slower than batch=1 because the attention layers parallelize well. Batch throughput gains are larger on H100 SXM (HBM3 bandwidth at 3,350 GB/s) than on H100 PCIe or A100. For production APIs expecting burst traffic, a batch queue using asyncio in FastAPI or a Celery worker pool significantly improves throughput compared to processing requests one at a time.

Scaling Patterns: On-Demand vs Spot vs Serverless

On-demand: H100 PCIe at $2.01/hr works well for steady workloads, say an internal tool generating 500-2,000 images per day. No preemption risk. Shut down overnight; restart in the morning and you pay for active hours only.

Spot: A100 SXM4 spot at $0.45/hr is roughly 78% cheaper than H100 PCIe on-demand. Spot instances can be preempted with short notice. Use for batch jobs (dataset generation, fine-tuning, large creative projects) where you can checkpoint progress and resume. Not suitable for latency-sensitive user-facing APIs. H100 PCIe does not currently have a spot tier on Spheron.

Burst scaling: For bursty traffic, a single always-on GPU sits idle most of the time. One pattern: run a small always-on instance (RTX 4090 at $0.79/hr) for baseline load, then spin up H100 instances on-demand during peak hours using the Spheron API. Alternatively, queue jobs asynchronously and let a fixed fleet work through them at its own pace.

Cost Comparison: Spheron vs Runpod vs Replicate

Per-image costs below are for FLUX.2 Dev FP8, 1024x1024, 28 steps, approximately 14 images/min on H100 PCIe and 8 images/min on A100 SXM4 (FP8). These match the FP8 rows in the throughput table above.

Platform	GPU	$/hr	Images/min	$/1,000 images
Spheron	H100 PCIe (on-demand)	$2.01	~14	~$2.39
Spheron	A100 SXM4 80G (spot)	$0.45	~8	~$0.94
Runpod	H100 PCIe (on-demand)	higher rate	~14	higher
Replicate	FLUX.2 Dev (API)	per-image pricing	N/A	verify current Replicate rate

Runpod H100 PCIe on-demand lists at a higher rate than Spheron's $2.01/hr; verify Runpod's current pricing page before making direct comparisons. Replicate per-image pricing varies by model version and step count; check Replicate's current pricing for FLUX.2 Dev before comparing.

A100 SXM4 spot at $0.45/hr cuts cost significantly for interruptible batch workloads. At 100,000 images per month: Spheron H100 PCIe on-demand runs roughly $239, Spheron A100 SXM4 spot roughly $94.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader cost and feature comparison, see Spheron vs Runpod: GPU Cloud Comparison. For a full breakdown of Replicate vs dedicated GPU options for image generation workflows, including the monthly cost crossover calculation and Cog migration steps, see Replicate alternatives for image generation.

For instruction-based editing of existing images (rather than generation from scratch), see our guide on deploying Qwen-Image-Edit, OmniGen 2, and FLUX.1 Kontext on GPU cloud, which covers VRAM sizing, inference stacks, and a cost breakdown against Adobe Firefly.

Summary

For teams starting out with FLUX.2 in production: H100 PCIe FP8 on Spheron at $2.01/hr is the lowest-friction path. You get 80GB VRAM (FP8 at ~32GB fits comfortably), hardware FP8 Tensor Cores, and around 14 images/min at 28 steps. That covers most production image generation use cases without needing to think about quantization trade-offs.

For cost-focused teams that can tolerate spot preemption: A100 SXM4 80G spot at $0.45/hr running FP8 or BF16 is very competitive. The A100 is slower than the H100 for the same workload, but the cost per image wins clearly. At roughly $0.09/100 images on A100 spot vs $0.24/100 on H100 PCIe FP8 on-demand, A100 spot is the right choice for batch jobs.

For highest throughput: H100 SXM5 at $2.90/hr or B200 SXM6 at $2.06/hr spot depending on budget. The H100 SXM5 in FP8 reaches around 55 images/min at batch=4, making it the right choice for pipelines generating large datasets or processing thousands of images per hour. B200 SXM6 is currently spot-only.

For teams where VRAM budget is tighter, Ideogram 4's 9.3B open-weight DiT fits in 24GB at nf4 and delivers stronger text rendering than most models at this size. See the Ideogram 4 deployment guide on GPU cloud.

If you're past deployment and want to train a custom LoRA on top of FLUX.2, see our Flux.2 and Wan LoRA training cost guide, which covers ai-toolkit setup, VRAM by GPU tier, and real dollar costs per training run.

FLUX.2 image generation workloads fit Spheron's on-demand GPU model well: spin up an H100 or A100 for a batch job, pay only for the hours you use, and skip the idle hardware cost. A100 SXM4 spot at $0.45/hr cuts costs further for workloads that can checkpoint.
Rent H100 PCIe → | A100 GPU pricing → | View all GPU pricing →
Get started on Spheron →

STEPS / 05

Quick Setup Guide

Provision a GPU instance on Spheron
Go to app.spheron.ai and select an H100 PCIe ($2.01/hr) for FP8 FLUX.2-dev (BF16 does not fit in 80GB for the 32B model), or an A100 SXM4 80G ($1.65/hr on-demand, $0.45/hr spot) for FP8 or BF16 with text encoder offloading. For lower-VRAM options, use FLUX.2-klein-4B on an RTX 4090 ($0.79/hr). Choose Ubuntu 22.04 as your base image. Do not expose port 7860 or 8188 publicly; access via SSH tunnel only.
Pull the container and install dependencies
For diffusers: docker pull pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime, then pip install diffusers transformers accelerate sentencepiece. For ComfyUI: docker pull ghcr.io/ai-dock/comfyui:latest-cuda and run with --gpus all --ipc=host.
Download FLUX.2 model weights
For FP8: huggingface-cli download black-forest-labs/FLUX.2-dev --include '*.safetensors' after accepting the model license on HuggingFace. Note the BF16 safetensors file is ~64GB; FP8 community weights are ~32GB. For GGUF Q8_0: huggingface-cli download city96/FLUX.2-dev-gguf --include 'flux2-dev-Q8_0.gguf' (~35GB, requires 80GB GPU). For RTX 4090, download Q4_K_S (~19GB) instead: huggingface-cli download city96/FLUX.2-dev-gguf --include 'flux2-dev-Q4_K_S.gguf'.
Run inference with the diffusers pipeline
Load with FP8 quantization (pip install optimum-quanto): from diffusers import Flux2Pipeline; from optimum.quanto import freeze, qfloat8, quantize; pipe = Flux2Pipeline.from_pretrained('black-forest-labs/FLUX.2-dev', torch_dtype=torch.bfloat16); quantize(pipe.transformer, weights=qfloat8); freeze(pipe.transformer); pipe.to('cuda'). FP8 reduces VRAM from ~64GB to ~32GB, fitting H100 PCIe and A100 80G. Call pipe(prompt, num_inference_steps=28, guidance_scale=3.5). Enable torch.compile with pipe.transformer = torch.compile(pipe.transformer, mode='reduce-overhead') to improve sustained throughput by 30-50% after warmup.
Wrap in a FastAPI endpoint for production serving
Create a FastAPI app with a POST /generate endpoint that accepts prompt, steps, and guidance_scale. Load the pipeline once at startup (pipe = Flux2Pipeline...) and reuse it across requests. Use asyncio.Lock to serialize GPU calls for single-GPU serving, or deploy multiple instances behind a load balancer for parallelism.

FAQ / 05

Frequently Asked Questions

FLUX.2-dev (32B) in BF16 requires roughly 64GB for model weights plus text encoder overhead, putting it beyond a single 80GB GPU in practice. FP8 quantization brings this to about 32GB, fitting comfortably on an H100 PCIe or A100 80G. GGUF Q8_0 is around 35GB, still requiring 80GB VRAM. For smaller GPUs like the RTX 4090 (24GB), only GGUF Q4_K (~19GB) or smaller variants work, or you can use FLUX.2-klein-4B (~13GB VRAM) for sub-second generation.

Yes. An A100 80G (SXM4 or PCIe) runs FLUX.2-dev comfortably in FP8 (32GB) and is workable in BF16 with text encoder offloading. At $1.65/hr on-demand or $0.45/hr spot on Spheron, the A100 SXM4 offers a strong price-per-image ratio for production FLUX.2 inference.

FLUX.2-dev is the full 32B rectified flow transformer with highest output quality and reliable text rendering. It targets 20-50 inference steps. FLUX.2-klein-9B and FLUX.2-klein-4B are distilled variants step-distilled to 4 inference steps for sub-second generation. Dev gives better prompt adherence and text accuracy. The klein models trade some quality for dramatically lower latency and cost per image, and require far less VRAM.

ComfyUI suits interactive, workflow-driven use cases and is faster to set up for one-off generation or fine-tuning experiments. The diffusers library is better for production APIs: it integrates with FastAPI or similar frameworks, supports batching, and works well with torch.compile for sustained throughput. Both run on Spheron GPU instances without modification.

At the time this post was written, Spheron lists H100 PCIe on-demand at $2.01/hr. Runpod lists H100 PCIe on-demand at a higher rate; verify Runpod's current price before making direct comparisons. At 1,000 FLUX.2 Dev FP8 images per day on H100 PCIe, Spheron A100 SXM4 spot at $0.45/hr offers a compelling cost advantage for interruptible batch workloads.

What is FLUX.2?

VRAM Requirements and Quantization

GPU Options for FLUX.2 on Spheron

Setup Guide: ComfyUI Deployment

Setup Guide: diffusers Library for Production APIs

Production Inference: Batching and Throughput

Scaling Patterns: On-Demand vs Spot vs Serverless

Cost Comparison: Spheron vs Runpod vs Replicate

Summary

Quick Setup Guide

Provision a GPU instance on Spheron

Pull the container and install dependencies

Download FLUX.2 model weights

Run inference with the diffusers pipeline

Wrap in a FastAPI endpoint for production serving

Frequently Asked Questions

01How much VRAM does FLUX.2 need?

02Can I run FLUX.2 on an A100?

03What is the difference between FLUX.2 Dev and FLUX.2-klein?

04Should I use ComfyUI or diffusers for FLUX.2 deployment?

05How does Spheron compare to Runpod for FLUX.2 on H100?

Build what's next.