Tutorial

Deploy FLUX.2 on GPU Cloud: Production Image Generation Setup Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 19, 2026
FLUX.2FLUX.2 DeploymentBlack Forest LabsImage GenerationGPU CloudComfyUIDiffusersFP8 QuantizationText-to-Image
Deploy FLUX.2 on GPU Cloud: Production Image Generation Setup Guide (2026)

FLUX.2 from Black Forest Labs is the first production-grade diffusion transformer with reliable text rendering in generated images, and the question teams now face is which GPU to run it on and at what cost. This guide covers VRAM requirements, FP8 vs GGUF quantization trade-offs, Docker-based ComfyUI setup, diffusers-based API deployment, and a direct cost comparison against RunPod and Replicate for H100 and A100 configurations.

What is FLUX.2?

FLUX.2 is a 32 billion parameter rectified flow transformer from Black Forest Labs. Unlike Stable Diffusion's U-Net design, FLUX.2 uses a transformer backbone throughout. The practical difference is prompt comprehension and text rendering: FLUX.2-dev uses a Mistral Small 3.1 (24B) multimodal vision-language encoder, which gives it substantially better handling of complex, long prompts compared to earlier models, and the transformer architecture makes embedded text in generated images work reliably. If you've tried generating images with text in them using SDXL or earlier Flux, you know how broken it was.

Three variants are relevant for self-hosted inference. FLUX.2 Dev is the full 32B guidance-distilled model, targeting 20-50 inference steps and offering the highest output quality. FLUX.2-klein-9B and FLUX.2-klein-4B are distilled variants step-distilled to 4 inference steps for sub-second generation. The klein models require far less VRAM (29GB and 13GB respectively) and suit latency-critical workloads or lower-budget GPU setups. For production image generation APIs where quality matters, Dev is the right choice. For high-throughput pipelines where cost or latency dominates, the klein models cut compute significantly.

FLUX.2 Dev is available under the FLUX Non-Commercial License, which requires accepting terms on HuggingFace. Commercial production use requires either the Pro API or a separate license from BFL. FLUX.2-klein-4B uses the Apache 2.0 license. FLUX.2-klein-9B uses the FLUX Non-Commercial License; verify terms on HuggingFace before commercial deployment.

For a comparison of H100, H200, and B200 for inference workloads broadly, see Best GPU for AI Inference in 2026.

VRAM Requirements and Quantization

VRAM is the first practical constraint. FLUX.2-dev is a 32B model, which makes it considerably heavier than FLUX.1. The model has two components that consume VRAM simultaneously: the transformer backbone and the text encoder stack. Both must fit in VRAM at the same time.

File sizes below are taken from the city96/FLUX.2-dev-gguf repository, which reflects actual disk (and VRAM) usage.

VariantQuantizationFile Size / VRAMQuality LossMinimum GPU
FLUX.2 DevBF16~64GBNone (reference)Needs >80GB; use FP8 instead
FLUX.2 DevFP8~32GBMinimal (perceptually transparent)H100 PCIe (80GB), A100 80G
FLUX.2 DevGGUF Q8_0~35GBMinimalH100 PCIe (80GB), A100 80G
FLUX.2 DevGGUF Q5_K_M~24GBLowA100 PCIe 80G
FLUX.2 DevGGUF Q4_K_S~19GBModerate (softened fine detail)RTX 4090 (24GB)
FLUX.2 DevGGUF Q2_K~13GBHighRTX 4090 (24GB)
FLUX.2-klein-9BFP8/BF16~29GBN/A (distilled)H100 PCIe, A100 80G
FLUX.2-klein-4BFP8/BF16~13GBN/A (distilled)RTX 4090 and above

BF16 at 64GB is too large for any single 80GB GPU when you add text encoder overhead and activation memory. Use FP8 (~32GB) on H100 or A100 80G. RTX 4090 users should use GGUF Q4_K_S (~19GB) or FLUX.2-klein-4B (~13GB) rather than FP8 or Q8_0; neither fits in 24GB.

FP8 uses hardware Tensor Cores natively on Hopper (H100) and Ada Lovelace (RTX 4090). On Ampere (A100), FP8 is emulated in software and may be slower than BF16 for some workloads. If you're running on A100, test BF16 and FP8 side-by-side before committing to FP8 in production. The A100's 80GB VRAM makes FP8 and Q8_0 comfortable.

GGUF quantization (via a diffusion-specific llama.cpp port) keeps the transformer on GPU but offloads the text encoder to CPU. This reduces peak VRAM but introduces CPU overhead for the text encoding step. For low-VRAM situations it's useful; for production throughput on H100 or A100, stick with FP8.

The practical recommendation: FP8 on H100 PCIe or A100 80G, GGUF Q4_K_S on RTX 4090, FLUX.2-klein-4B for sub-second latency use cases or GPUs below 20GB VRAM. See the RTX 4090 for AI/ML guide for more on what the 4090 can handle at different quantizations.

GPU Options for FLUX.2 on Spheron

Pricing below is from Spheron's live marketplace as of 19 Apr 2026. GPU pricing fluctuates based on availability. Check current GPU pricing → for live rates.

GPUVRAMOn-Demand $/hrSpot $/hrBest FLUX.2 Use Case
B200 SXM6192GB HBM3eN/A$2.06Batch FP8, maximum throughput (spot only)
H100 SXM580GB HBM3$2.90N/AFP8, high throughput
H100 PCIe80GB HBM2e$2.01N/AFP8 workhorse GPU
A100 SXM4 80G80GB HBM2$1.65$0.45FP8/BF16 budget option, spot ideal
A100 PCIe 80G80GB HBM2$1.04$1.14†FP8/GGUF Q8_0, on-demand is cheapest option
RTX 409024GB GDDR6X$0.79N/AGGUF Q4_K or FLUX.2-klein, solo inference

†A100 PCIe 80G spot is currently priced higher than on-demand ($1.14 vs $1.04/hr). For this GPU, on-demand is the cheaper choice. Spot pricing can shift, so check current pricing → before provisioning.

H100 PCIe at $2.01/hr is the workhorse for FP8 FLUX.2-dev. 80GB VRAM fits FP8 comfortably at 32GB. Note: BF16 (~64GB weights plus encoder) does not fit in 80GB in practice. A100 SXM4 80G at $1.65/hr on-demand and $0.45/hr spot is the value pick for teams that can tolerate spot preemption. FP8 fits comfortably on the A100's 80GB HBM2, and BF16 is possible with text encoder CPU offloading. See the A100 GPU rental guide for more deployment details.

RTX 4090 at $0.79/hr works for GGUF Q4_K_S (~19GB) or FLUX.2-klein-4B (~13GB). FP8 (~32GB) and Q8_0 (~35GB) both exceed its 24GB GDDR6X. B200 SXM6 at $2.06/hr spot fits large-batch or highest-throughput scenarios such as dataset generation pipelines. It is currently available spot-only. See the NVIDIA B200 complete guide for B200 specs and workload fit.

Rent an H100 → | View all GPU pricing →

Setup Guide: ComfyUI Deployment

ComfyUI on Docker is the fastest path to running FLUX.2 interactively. This covers FLUX.2 weights specifically, not Flux.1.

Step 1: Launch a Spheron GPU instance. H100 PCIe or A100 80G is recommended for FP8 or GGUF Q8_0 (~32-35GB VRAM each). RTX 4090 works for GGUF Q4_K_S (~19GB). BF16 (~64GB) is not practical on any single 80GB GPU for FLUX.2-dev. Select Ubuntu 22.04 as the base image. For a full walkthrough of account setup and instance deployment, see the Spheron getting started guide.

Step 2: Pull and run the ComfyUI container:

bash
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda

docker pull $IMAGE
docker run -d \
  --gpus all \
  --ipc=host \
  -p 127.0.0.1:8188:8188 \
  -v ~/comfyui-models:/opt/ComfyUI/models \
  -v ~/comfyui-output:/opt/ComfyUI/output \
  $IMAGE

Binding to 127.0.0.1 keeps the port off the public interface. Do not bind to 0.0.0.0.

Step 3: Download FLUX.2 weights. FLUX.2 Dev requires a HuggingFace account and license acceptance before the download will succeed.

bash
pip install huggingface_hub
huggingface-cli login   # paste your HF token; required for FLUX.2 Dev

# BF16 weights (~64GB); requires 80GB+ GPU, FP8 is recommended instead
huggingface-cli download black-forest-labs/FLUX.2-dev \
  --include "flux2-dev.safetensors" \
  --local-dir ~/comfyui-models/checkpoints

# GGUF Q8_0 weights (~35GB, requires H100 PCIe or A100 80G; no license gate)
huggingface-cli download city96/FLUX.2-dev-gguf \
  --include "flux2-dev-Q8_0.gguf" \
  --local-dir ~/comfyui-models/checkpoints

# GGUF Q4_K_S weights (~19GB, for RTX 4090 and lower-VRAM GPUs)
huggingface-cli download city96/FLUX.2-dev-gguf \
  --include "flux2-dev-Q4_K_S.gguf" \
  --local-dir ~/comfyui-models/checkpoints

Step 4: Connect via SSH tunnel:

bash
ssh -L 8188:localhost:8188 user@your-server-ip

Then open http://localhost:8188 in your browser.

Step 5: Load a FLUX.2 workflow. Use UnetLoaderGGUF for GGUF variants or CheckpointLoaderSimple for safetensors. FLUX.2-dev uses a single Mistral-3 24B vision-language encoder (one encoder node in ComfyUI, not the dual T5-XXL + CLIP-L stack used by FLUX.1 and SDXL). FLUX.2-klein variants use a Qwen3 encoder instead. Check the model card at https://huggingface.co/black-forest-labs/FLUX.2-dev for the exact architecture before building custom workflows.

For a full introduction to ComfyUI setup on cloud GPU including SSH tunneling and security configuration, see ComfyUI on GPU Cloud 2026.

Setup Guide: diffusers Library for Production APIs

For teams building an image generation API rather than using a GUI, the diffusers library integrates cleanly with FastAPI and supports batching.

python
import torch
from diffusers import Flux2Pipeline
from optimum.quanto import freeze, qfloat8, quantize

# Flux2Pipeline is the correct class for FLUX.2-dev (available in diffusers >= 0.33).
# For FLUX.2-klein, use Flux2KleinPipeline instead.
# FluxPipeline is FLUX.1 only and will not work with FLUX.2-dev weights.
#
# BF16 loads ~64GB of weights plus text encoder overhead, too large for a single 80GB GPU.
# Use FP8 quantization to bring VRAM usage to ~32GB, which fits H100 PCIe and A100 80G.
# Install: pip install optimum-quanto
pipe = Flux2Pipeline.from_pretrained(
    "black-forest-labs/FLUX.2-dev",
    torch_dtype=torch.bfloat16,
)
quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)
pipe = pipe.to("cuda")

# Optional: torch.compile for sustained throughput
# First call takes 3-5 minutes to compile; subsequent calls are 30-50% faster
# Only worth using for sustained workloads, not single-shot generation
pipe.transformer = torch.compile(
    pipe.transformer,
    mode="reduce-overhead",
    fullgraph=True,
)

image = pipe(
    prompt="A photograph of a coffee mug with the text 'Monday' printed on it",
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=3.5,
    max_sequence_length=512,
).images[0]

image.save("output.png")

A few notes on the options:

FP8 loading: The example above uses optimum-quanto to quantize the transformer to FP8 after loading. This is the recommended path for H100 PCIe and A100 80G. On H100 and RTX 4090, hardware FP8 Tensor Cores give a real speedup. On A100, the speedup is smaller (FP8 is emulated in software on Ampere), but VRAM savings still apply. Alternatively, BitsAndBytesConfig(load_in_8bit=True) achieves a similar result.

Sequential CPU offload: pipe.enable_sequential_cpu_offload() allows FLUX.2 BF16 to run on 16GB VRAM by moving tensors to CPU between steps. Latency goes from seconds to minutes. Useful for experimentation, not production.

torch.compile warmup: The first inference call after enabling torch.compile takes 3-5 minutes while PyTorch traces and compiles the computation graph. After that, each subsequent call is 30-50% faster. If you restart the process, it compiles again. Only enable this for processes that will run many thousands of images.

Production Inference: Batching and Throughput

All values are approximate. Configuration: Ubuntu 22.04, CUDA 12.4, diffusers 0.33+, 1024x1024 output, 28 steps.

GPUPrecisionImages/min (batch=1)Images/min (batch=4)$/100 images (on-demand)$/100 images (spot)
H100 SXM5FP8~18~55~$0.27N/A
H100 PCIeFP8~14~42~$0.24N/A
A100 SXM4 80GFP8~8~22~$0.34~$0.09
A100 SXM4 80GBF16 (CPU offload)~7~18~$0.39~$0.11
A100 PCIe 80GFP8~8~22~$0.22~$0.24†
RTX 4090GGUF Q4_K~4N/A~$0.33N/A

†A100 PCIe 80G spot ($1.14/hr) is currently more expensive than on-demand ($1.04/hr), so spot costs more per image. On-demand is the cheaper option for this GPU until spot pricing falls below on-demand.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

H100 PCIe BF16 is omitted: at 32B parameters, BF16 (~64GB) does not fit in 80GB alongside encoder overhead. Use FP8 (~32GB) on H100 PCIe. The A100 SXM4 80G table includes both precisions: FP8 at ~8 images/min (recommended) and BF16 with CPU text encoder offloading at ~7 images/min (slower due to CPU overhead, but avoids the quantization step). RTX 4090 uses GGUF Q4_K (~19GB) since FP8 and Q8_0 exceed its 24GB VRAM. RTX 4090 batch=4 is excluded due to limited headroom above the Q4_K model footprint.

FLUX.2 at batch=4 is not 4x slower than batch=1 because the attention layers parallelize well. Batch throughput gains are larger on H100 SXM (HBM3 bandwidth at 3,350 GB/s) than on H100 PCIe or A100. For production APIs expecting burst traffic, a batch queue using asyncio in FastAPI or a Celery worker pool significantly improves throughput compared to processing requests one at a time.

Scaling Patterns: On-Demand vs Spot vs Serverless

On-demand: H100 PCIe at $2.01/hr works well for steady workloads, say an internal tool generating 500-2,000 images per day. No preemption risk. Shut down overnight; restart in the morning and you pay for active hours only.

Spot: A100 SXM4 spot at $0.45/hr is roughly 78% cheaper than H100 PCIe on-demand. Spot instances can be preempted with short notice. Use for batch jobs (dataset generation, fine-tuning, large creative projects) where you can checkpoint progress and resume. Not suitable for latency-sensitive user-facing APIs. H100 PCIe does not currently have a spot tier on Spheron.

Burst scaling: For bursty traffic, a single always-on GPU sits idle most of the time. One pattern: run a small always-on instance (RTX 4090 at $0.79/hr) for baseline load, then spin up H100 instances on-demand during peak hours using the Spheron API. Alternatively, queue jobs asynchronously and let a fixed fleet work through them at its own pace.

Cost Comparison: Spheron vs RunPod vs Replicate

Per-image costs below are for FLUX.2 Dev FP8, 1024x1024, 28 steps, approximately 14 images/min on H100 PCIe and 8 images/min on A100 SXM4 (FP8). These match the FP8 rows in the throughput table above.

PlatformGPU$/hrImages/min$/1,000 images
SpheronH100 PCIe (on-demand)$2.01~14~$2.39
SpheronA100 SXM4 80G (spot)$0.45~8~$0.94
RunPodH100 PCIe (on-demand)higher rate~14higher
ReplicateFLUX.2 Dev (API)per-image pricingN/Averify current Replicate rate

RunPod H100 PCIe on-demand lists at a higher rate than Spheron's $2.01/hr; verify RunPod's current pricing page before making direct comparisons. Replicate per-image pricing varies by model version and step count; check Replicate's current pricing for FLUX.2 Dev before comparing.

A100 SXM4 spot at $0.45/hr cuts cost significantly for interruptible batch workloads. At 100,000 images per month: Spheron H100 PCIe on-demand runs roughly $239, Spheron A100 SXM4 spot roughly $94.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader cost and feature comparison, see Spheron vs RunPod: GPU Cloud Comparison.

Summary

For teams starting out with FLUX.2 in production: H100 PCIe FP8 on Spheron at $2.01/hr is the lowest-friction path. You get 80GB VRAM (FP8 at ~32GB fits comfortably), hardware FP8 Tensor Cores, and around 14 images/min at 28 steps. That covers most production image generation use cases without needing to think about quantization trade-offs.

For cost-focused teams that can tolerate spot preemption: A100 SXM4 80G spot at $0.45/hr running FP8 or BF16 is very competitive. The A100 is slower than the H100 for the same workload, but the cost per image wins clearly. At roughly $0.09/100 images on A100 spot vs $0.24/100 on H100 PCIe FP8 on-demand, A100 spot is the right choice for batch jobs.

For highest throughput: H100 SXM5 at $2.90/hr or B200 SXM6 at $2.06/hr spot depending on budget. The H100 SXM5 in FP8 reaches around 55 images/min at batch=4, making it the right choice for pipelines generating large datasets or processing thousands of images per hour. B200 SXM6 is currently spot-only.


FLUX.2 image generation workloads fit Spheron's on-demand GPU model well: spin up an H100 or A100 for a batch job, pay only for the hours you use, and skip the idle hardware cost. A100 SXM4 spot at $0.45/hr cuts costs further for workloads that can checkpoint.

Rent H100 PCIe → | Rent A100 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.