Fal.ai bills per output. For FLUX.2-dev at 1024x1024 with 28 steps, that works out to roughly $0.012 per image based on their $0.012/megapixel published rate. Run 100 images and you pay around $1.20. Run the same 100 images on a Spheron H100 PCIe at $2.01/hr: the H100 generates approximately 7 images/min at FP8 (FLUX.2-dev is a 32B model; community FP8 benchmarks on H100 PCIe are typically in the 6-8 img/min range), so 100 images takes about 14 minutes and costs roughly $0.47. That is a ~2.5x cost difference at this volume, and it grows wider as volume increases.
The video picture is different. Fal.ai charges per second of video output for Wan 2.5. At their published rate of $0.05/sec, a 5-second 720p clip runs approximately $0.25. Self-hosted Wan 2.2 14B on an H100 PCIe takes about 11 minutes per clip and costs approximately $0.37. At 50 clips per day, Fal.ai is $12.50 vs Spheron at $18.50. Self-hosting for video is driven by model access, not raw clip cost: custom Wan configurations, resolution overrides, and sampler adjustments that Fal.ai's API does not expose. For a deeper look at Wan 2.5 GPU setup and benchmark data, see Deploy Wan 2.5 on GPU Cloud.
Beyond cost, Fal.ai has a model access constraint. Their API exposes parameters like prompt, steps, guidance scale, and a single LoRA URL. Teams that need custom LoRA chains, non-default samplers, or checkpoint-level access cannot get that through Fal.ai's API surface. You get what they expose. This guide covers 10 alternatives, with specific pricing and tradeoffs for each.
Why Teams Look Beyond Fal.ai
Per-output pricing math
Serverless per-output billing is efficient at low volumes. At scale, the economics invert. Here is what daily generation volumes look like comparing Fal.ai to a self-hosted H100 PCIe:
| Images per day | Fal.ai (~$0.012/image est.) | Spheron H100 PCIe ($2.01/hr, ~7 img/min est.) | Better value |
|---|---|---|---|
| 50 images/day | ~$0.60 | ~$0.24 | Spheron (cost), Fal.ai (zero ops) |
| 200 images/day | ~$2.40 | ~$0.96 | Spheron (~2.5x cheaper) |
| 1,000 images/day | ~$12 | ~$4.79 | Spheron |
| 10,000 images/day | ~$120 | ~$47.90 | Spheron |
| 50,000 images/day | ~$600 | ~$239.30 | Spheron |
Fal.ai pricing estimated from their public pricing docs as of 05 May 2026. Spheron pricing live-fetched 05 May 2026. Check current GPU pricing → for live Spheron rates.
The crossover is not the relevant question. Even at 50 images/day, the Spheron H100 is ~2.5x cheaper on a compute basis. The real question is whether zero infrastructure overhead is worth paying ~2.5x more for image generation. For teams that need true zero ops at low volumes, that tradeoff is defensible. For anything above ~200 images/day, it is not.
Queue latency and cold starts
Fal.ai queues requests when capacity is constrained. For a popular model like FLUX.2-dev, queue wait times are usually short. For less-popular models or custom endpoints, cold-start overhead of 15-45 seconds is common. On a self-hosted instance, your model stays resident in VRAM between requests. There is no queue and no cold start: the first request after startup gets the same latency as the hundredth.
For user-facing products where someone submitted a prompt and is waiting, even a 15-second queue is noticeable. For batch generation pipelines, queue variability makes throughput planning difficult.
Model and parameter lock-in
Fal.ai's API surface defines what you can do. You can pass a prompt, select from their supported models, set standard inference parameters, and attach one LoRA via URL. Multi-LoRA chains, custom schedulers, custom VAE swaps, negative embedding files, and checkpoint blending are not API-accessible concepts. Teams building complex generation pipelines with fine-grained control over the diffusion process need direct access to the model stack, which only self-hosting provides.
Fal.ai at a Glance
Fal.ai is a serverless inference platform focused on image and video generation:
- Supported models: FLUX.1-dev, FLUX.1-schnell, FLUX.2 variants, Stable Diffusion 3.5, Wan 2.x, HunyuanVideo, LTX-Video, ControlNet adapters, various fine-tuned variants
- Pricing model: Per-image (resolution + steps-based for image models) and per-second of output video for video models
- Strengths: Zero infrastructure, rapid time-to-first-generation for burst use cases, large model catalog, no container management
- Constraints: No custom checkpoint uploads, single LoRA only, API-bounded inference parameters, per-output costs that scale linearly with volume
This section is factual. Fal.ai is genuinely good at what it is optimized for: low-volume, zero-ops generation where time-to-first-image matters more than cost per image.
Evaluation Criteria
Five criteria used to rank these alternatives:
- Bare-metal vs. serverless control: dedicated GPU with root access vs. managed API. Determines hardware control, cold starts, and pricing model.
- Supported model families: FLUX, SD3, Wan, HunyuanVideo, LTX-Video, custom LoRA. Coverage and recency matter.
- Custom LoRA and sampler support: whether you can load arbitrary checkpoints, chain multiple LoRAs, or use custom samplers and schedulers.
- Per-output vs. per-hour pricing model: how predictable your monthly bill is as generation volume scales.
- Cold-start latency: time from idle to first generated output. Critical for user-facing products and low-traffic endpoints.
Quick Comparison: Fal.ai vs 10 Alternatives
| Provider | Pricing Model | FLUX.2 | Wan 2.5 | Custom LoRA | Cold Starts | Best For |
|---|---|---|---|---|---|---|
| Fal.ai (baseline) | Per-generation | Yes | Yes | Single LoRA (URL) | Occasional | Low-volume, zero-ops image/video |
| Spheron | Per-minute | Yes (self-hosted) | Yes (self-hosted) | Full checkpoint access | None | Sustained generation, custom workflows |
| Replicate | Per-second GPU | Yes | Yes | Replicate LoRA API | Yes | Community model hosting, prototyping |
| RunPod Serverless | Per-second | Yes (BentoML/vLLM) | Yes | Custom containers | Yes | Mixed bursty + dedicated workloads |
| Modal | Per-second | Yes (Python-native) | Yes | Custom containers | Yes | Python-native burst inference |
| Together AI | Per-token / per-hr | LLM-focused | No | Fine-tuned endpoints | Yes (serverless) | Open-weight LLM catalog |
| Baseten | Per-call | Yes (Truss) | Partial | Truss framework | Optional dedicated | Enterprise model APIs, SLA contracts |
| BentoCloud | Per-second | Yes | Partial | Custom Bentos | Yes | Python ML serving, AutoML pipelines |
| Self-host (ComfyUI) | Infra cost only | Yes (full) | Yes (full) | Full checkpoint access | None | Maximum control, complex workflows |
| HuggingFace Endpoints | Per-hour (dedicated) | Yes (Hub models) | Yes (Hub models) | HF-compatible | Configurable | HuggingFace Hub model hosting |
| Fireworks AI | Per-token | LLM-focused | No | Fine-tuned endpoints | Yes | Low-latency LLM serverless |
GPU rates fetched 05 May 2026. Third-party rates estimated from public pricing pages as of 05 May 2026.
Now let's break down each one.
1. Spheron: Bare-Metal GPU, Full Checkpoint Access, Per-Minute Billing
H100 PCIe: $2.01/hr | H100 SXM5: $4.41/hr | B200: on-demand and spot (current rates) | Per-minute billing | No contracts
Pricing fetched 05 May 2026. Rates fluctuate with GPU availability.
Spheron is the most direct cost alternative for teams that have grown past Fal.ai's per-output pricing or need access to the model stack that a managed API cannot provide. The difference from Fal.ai is fundamental: you get a dedicated bare-metal GPU, root SSH access, and no layer between your code and the GPU.
The cost math is direct. H100 on Spheron starts at $2.01/hr for the PCIe 80GB. FLUX.2-dev at FP8 generates approximately 7 images per minute (28 steps, 1024x1024; FLUX.2-dev is a 32B model and community FP8 benchmarks on H100 PCIe typically land in the 6-8 img/min range). At that throughput, 1,000 images costs about 143 minutes of GPU time: $4.79 total. The same 1,000 images via Fal.ai at ~$0.012/image is $12. At 10,000 images the gap is $47.90 vs $120.
For Wan 2.5 video generation, the B200's 192GB VRAM eliminates the tight VRAM margins that H100 PCIe faces at 720p FP16. B200 instances on Spheron are available for high-volume video workflows. For FLUX.2 deployment specifics and container setup, see Deploy FLUX.2 on GPU Cloud. For ComfyUI workflow setup and the WanVideoWrapper node, see ComfyUI on GPU Cloud 2026.
What Spheron does well
- Per-minute billing with no minimum commitment
- H100, H200, A100, B200, L40S, and RTX-series on demand
- Full bare-metal access: load any checkpoint, any LoRA stack, any sampler
- No proprietary container format: run ComfyUI, diffusers, custom inference servers
- Spot instances available across GPU families
- Multi-GPU clusters with InfiniBand for distributed inference
Where it falls short
- No serverless or scale-to-zero: you provision and manage instances
- No hosted model catalog: you bring your own weights
- Monitoring, health checks, and scaling are your responsibility
Best for
Teams running sustained FLUX.2 or video generation workflows above ~200 images/day or ~30 clips/day, and anyone who needs custom LoRA stacking, checkpoint-level access, or full control over inference parameters.
2. Replicate: Per-Second Billing on a Large Model Catalog
Effective H100 rate: ~$5.49/hr | Per-second GPU billing | Scale-to-zero | FLUX.2, Wan support
Replicate rates based on their published per-second billing as of May 2026.
Replicate's serverless model is built around its public model registry. Thousands of community models are hosted and accessible via API call with no deployment work. FLUX.2 variants, Stable Diffusion, and Wan video models are available. You pay per second of GPU compute, and the platform scales to zero between requests.
Compared to Fal.ai, Replicate has a larger public model catalog and supports a wider range of non-image-generation models. The per-second billing works out to approximately $5.49/hr equivalent for H100 compute, which is more expensive than Fal.ai for image generation at similar throughput but gives you access to more model types. For a detailed Replicate comparison including the Cog migration path, see Replicate Alternatives.
What Replicate does well
- Largest public model catalog in this list
- Per-second billing is efficient for low-frequency inference
- No infrastructure work for community models
- Supports LLM inference, image gen, audio, and video under one API
Where it falls short
- Most expensive GPU rate if measured against effective H100 hourly cost
- Cog format lock-in for custom model deployments
- Cold starts on low-traffic models (30-120 seconds for large models)
Best for
Teams that primarily need access to community models at low volume and want the broadest catalog without deployment work.
3. RunPod: Serverless and On-Demand Under One Account
H100 SXM: ~$2.69/hr | Serverless per-second billing | Custom containers | Community GPU marketplace
RunPod rates from their deploy console, May 2026.
RunPod covers both dedicated GPU instances (RunPod On-Demand) and serverless endpoints (RunPod Serverless). If your team has a mix of bursty generation jobs and sustained workloads, RunPod handles both under one account. Custom Docker containers are supported on both tiers, so you can bring your own ComfyUI or diffusers setup rather than using a managed model endpoint.
For image generation, RunPod Serverless with a custom ComfyUI container gives you more control over the inference stack than Fal.ai while keeping per-request billing for bursty workloads. The on-demand H100 SXM rate of ~$2.69/hr is above Spheron's $2.01/hr PCIe rate but RunPod has a strong community template library.
What RunPod does well
- Serverless and on-demand in one platform with full custom container support
- Community template library for common image and video generation setups
- GPU marketplace with occasional low-cost community GPU instances
- Good documentation for ComfyUI and diffusers deployments
Where it falls short
- On-demand pricing slightly above Spheron for pure sustained inference
- Serverless cold starts on large model containers
- Marketplace GPU quality varies across provider tiers
Best for
Teams that need both serverless burst capacity and dedicated instances, especially those wanting custom containers without committing to a pure bare-metal setup.
4. Modal: Python-Native Serverless with Per-Second Billing
H100 effective rate: ~$3.95/hr | Scale-to-zero | Per-second billing | Python decorator workflow
Modal's serverless model is built around Python decorators. Add @app.function(gpu="H100") to your inference function and Modal handles scheduling, scaling, and container builds. For teams coming from Fal.ai who want more control over the Python-level inference logic without managing GPU infrastructure, Modal is a natural fit.
The tradeoff: Modal's SDK is deeply integrated into your code. Functions decorated with Modal primitives do not run outside Modal's runtime, which is a form of lock-in that mirrors Fal.ai's API surface lock-in. Effective H100 rate of ~$3.95/hr is below Fal.ai's equivalent but above bare-metal.
What Modal does well
- Python-native deployment with minimal operational overhead
- Auto-scaling to zero eliminates idle costs for bursty workloads
- GPU memory snapshots reduce cold start times on qualifying containers
- Custom Python inference code without needing a container management layer
Where it falls short
- SDK lock-in: Modal-decorated functions require Modal's runtime
- Higher effective GPU rate than bare-metal alternatives
- Cold starts still occur for large model deployments without snapshot optimization
Best for
Python-native teams running burst image generation where idle periods are long and per-second billing is more economical than reserving hourly capacity.
5. Together AI: Open-Weight LLM Catalog with GPU Clusters
Llama 3.3 70B: $0.88/1M tokens | H100 Instant Clusters: $3.49/hr | OpenAI-compatible API
Together AI is primarily an LLM inference platform. For teams coming from Fal.ai who need image generation alternatives, Together AI is not the right fit: they do not have FLUX.2, Wan video, or SD3 in their model catalog. Where they excel is open-weight LLM access with an OpenAI-compatible API and competitive per-token pricing on popular models like Llama, Qwen, and DeepSeek.
If you are using Fal.ai for both image generation and occasional LLM inference, Together AI covers the LLM workloads at better rates while you migrate image workloads to a dedicated provider.
What Together AI does well
- Broad open-weight LLM catalog, often among the first to add new model releases
- Competitive per-token rates with OpenAI-compatible endpoints
- Dedicated Instant GPU Clusters at $3.49/hr for guaranteed capacity
- Fine-tuned model hosting with per-token billing
Where it falls short
- No support for image generation, video generation, or diffusion models
- Per-token costs at high LLM volumes exceed dedicated GPU rates
- Not suited for ComfyUI workflows or diffusion inference
Best for
Teams migrating off Fal.ai for LLM workloads specifically; not a substitute for image or video generation.
6. Baseten: Production Model Serving with Enterprise SLAs
H100: ~$6.50/hr effective | Truss deployment framework | Private VPCs | SLA contracts
Baseten targets production model APIs at enterprise scale. Their Truss framework handles container build and scaling; you define the model and dependencies, and Baseten manages the rest. They support image generation models including FLUX variants through Truss deployments, with optional dedicated GPU endpoints for latency-sensitive workloads.
At ~$6.50/hr effective H100 rate, Baseten is the most expensive option in this list. The premium covers production tooling: SLA contracts, private VPCs, compliance documentation, and dedicated account engineering. For teams where the operational overhead of self-managed GPU infrastructure is genuinely a hard cost, Baseten's pricing can be defensible.
What Baseten does well
- Production SLA contracts for enterprise image generation APIs
- Private VPC deployments for data residency requirements
- TensorRT-LLM optimization for LLM serving alongside image generation
- Managed scaling and observability without infrastructure management
Where it falls short
- Most expensive GPU rate in this comparison
- Truss adds another abstraction layer to maintain
- Not price-competitive for teams comfortable running their own inference stack
Best for
Enterprise teams needing SLA contracts, compliance documentation, and managed production APIs for image generation at scale, where raw cost is secondary to operational support.
7. BentoCloud: Python ML Serving with Container Flexibility
Serverless | Per-second billing | Custom Bentos | NVIDIA GPU support | Scale-to-zero
BentoCloud is a managed serving platform built around BentoML, an open-source ML serving framework. You package your inference code as a Bento (a standard container format), push it to BentoCloud, and deploy it as a serverless endpoint. FLUX and diffusion models work through custom BentoML services rather than a hosted model registry.
Compared to Fal.ai, BentoCloud gives more control over the inference code and the model loading process. Compared to Modal, BentoML is more ML-workflow-aware with built-in support for model management, adaptive batching, and multi-model serving pipelines.
What BentoCloud does well
- Flexible Python ML serving with adaptive batching built in
- BentoML open-source framework means no code lock-in to the managed layer
- ONNX and custom runtime support alongside PyTorch
- Multi-model pipeline serving in one endpoint
Where it falls short
- Less out-of-the-box FLUX or Wan model support compared to Fal.ai
- Per-second serverless cold starts on large model containers
- Smaller community and less documentation for image generation specifically
Best for
Teams already using BentoML for other ML serving who want to extend to image generation without switching frameworks.
8. Self-Hosted ComfyUI: Full Control at Infrastructure Cost
Your GPU hourly rate only | Any checkpoint | Any sampler | Full LoRA stacking | No API markup
Deploying ComfyUI on a cloud GPU is not really an "alternative" in the same sense as the others: it is a self-managed inference server on top of any GPU rental (Spheron, RunPod, Lambda, etc.). The platform cost is whatever GPU you provision; there is no additional inference API markup.
Self-hosted ComfyUI gives you the maximum flexibility: load any FLUX.2, Wan, SD3, or HunyuanVideo checkpoint, chain multiple LoRAs, use any sampler and scheduler, and build complex node-based workflows that Fal.ai's API cannot express. The cost is operational overhead: you manage the ComfyUI process, model weights, updates, and uptime.
What self-hosted ComfyUI does well
- Zero API markup: you pay only for GPU time
- Full access to the checkpoint and LoRA ecosystem
- Complex multi-stage workflows impossible in any managed API
- Deterministic outputs with the same seed and settings
Where it falls short
- Requires ongoing infrastructure management: updates, monitoring, restarts
- No built-in scaling or load balancing across multiple GPUs
- VRAM OOM errors require debugging at the container level
Best for
Teams that need the full ComfyUI workflow power and are comfortable managing GPU infrastructure, or anyone running complex LoRA stacking and multi-stage diffusion pipelines.
9. HuggingFace Inference Endpoints: Managed GPU for Hub Models
H100-class: $4.00-8.00/hr | Dedicated GPU endpoints | HuggingFace Hub integration | Pause/resume
HuggingFace Inference Endpoints lets you deploy any Hub model on a dedicated GPU endpoint without writing infrastructure code. FLUX.2, Stable Diffusion 3, and Wan models available on the Hub can be deployed with a few clicks. Per-hour billing while the endpoint is running, with a pause option to stop the billing when idle.
For teams using Fal.ai to access HuggingFace-hosted models, this eliminates the per-generation overhead while staying in the HF ecosystem. The limitation is that you are constrained to HF-compatible model formats and frameworks: arbitrary ComfyUI workflows or custom Python inference code outside the HF pipeline require wrapping.
What HuggingFace Inference Endpoints does well
- Native HF Hub integration with minimal configuration
- Pause/resume to avoid idle billing
- Supports text, image, video, and multimodal models from the Hub
- Managed scaling and health monitoring
Where it falls short
- Higher per-hour cost than bare-metal options for equivalent GPU class
- Limited to HF-compatible model formats; complex ComfyUI workflows do not translate directly
- Less flexible for custom sampler configurations or checkpoint blending
Best for
Teams already in the HuggingFace ecosystem who want managed GPU serving for Hub models without managing containers or GPU infrastructure.
10. Fireworks AI: Low-Latency LLM Serverless
Llama 3.1 8B: $0.20/1M tokens | DeepSeek V3: $0.56 input / $1.68 output per 1M | OpenAI-compatible
Like Together AI, Fireworks AI is an LLM inference platform. It is not a substitute for Fal.ai for image or video generation: they do not host FLUX, SD3, or Wan models. Where Fireworks excels is LLM inference at low to moderate token volumes with competitive per-token rates and fast time-to-first-token.
If you are using Fal.ai for both image generation and text inference, Fireworks can cover the LLM workloads more efficiently while you move image generation to a purpose-built alternative.
What Fireworks AI does well
- Competitive per-token rates, often below Together AI for the same models
- Fast time-to-first-token on popular open-weight LLMs
- Fine-tuned LoRA adapter hosting for custom text model serving
- OpenAI-compatible API with function calling
Where it falls short
- No image generation, no video generation, no diffusion models
- Per-token costs at high LLM volumes exceed dedicated GPU rates
- Not relevant for any image/video/diffusion use case
Best for
Teams migrating LLM inference workloads away from Fal.ai; not an image or video generation substitute.
Cost Comparison: Per-Image FLUX.2-dev Generation
These figures use H100 PCIe at $2.01/hr with ~7 images/min FP8 throughput (estimated from community benchmarks for FLUX.2-dev 32B at FP8; actual throughput may vary). Fal.ai and Replicate rates are estimated from public pricing docs as of 05 May 2026. RunPod Serverless reflects a self-hosted ComfyUI container on RunPod.
| Platform | Cost per image | 1K images/day | 10K images/day | 50K images/day |
|---|---|---|---|---|
| Fal.ai (FLUX.2-dev) | ~$0.012 est. | ~$12 | ~$120 | ~$600 |
| Replicate (FLUX.2-dev) | ~$0.092 | ~$92 | ~$920 | ~$4,600 |
| RunPod Serverless (custom container) | ~$0.050 est. | ~$50 | ~$500 | ~$2,500 |
| Spheron H100 PCIe (self-hosted) | ~$0.0048 | ~$4.79 | ~$47.90 | ~$239.30 |
Pricing fluctuates based on GPU availability. The prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.
Spheron numbers assume per-minute billing on active GPU time only: 1K images takes 143 minutes, 10K images takes 1,429 minutes, and 50K images takes 7,143 minutes (119 GPU-hours).
At 10K images/day the gap between Fal.ai and Spheron is ~$72/day, or roughly $2,160/month. At that volume the infrastructure management cost of self-hosting is a small fraction of the savings.
Cost Comparison: Per-Second Wan 2.5 Video Generation
H100 PCIe generates a 5-second 720p clip in approximately 10-12 minutes (11 minutes used here). Cost per second of output video on Spheron: $2.01/hr / 60 × 11 min per clip / 5 sec per clip = $0.074/sec. Fal.ai rate is from their published pricing page ($0.05/sec for Wan models as of 05 May 2026). See Deploy Wan 2.5 on GPU Cloud for detailed Spheron benchmark data.
| Platform | Cost per second of video | 100 sec/day | 1K sec/day | 5K sec/day |
|---|---|---|---|---|
| Fal.ai (Wan model) | ~$0.050 est. | ~$5 | ~$50 | ~$250 |
| Replicate (Wan model) | ~$0.16 est. | ~$16 | ~$160 | ~$800 |
| RunPod Serverless (custom container) | ~$0.10 est. | ~$10 | ~$100 | ~$500 |
| Spheron H100 PCIe (self-hosted) | ~$0.074 | ~$7.40 | ~$74 | ~$370 |
Pricing fluctuates based on GPU availability. The prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.
At 5K seconds of video per day (1,000 five-second clips), Fal.ai's per-second billing at $250 undercuts Spheron's $370 on raw per-unit cost. The advantage of self-hosting Wan 2.5 is model control and configuration access, not per-clip price. For broader image-to-video platform comparisons including LTX-Video and HunyuanVideo, see the image-to-video GPU cloud guide.
When Fal.ai Still Wins
Fal.ai is the right choice in specific situations:
- Low-volume burst generation: Under ~50 images per day or ~20 clips per day, the per-output cost premium is offset by zero infrastructure overhead. You pay nothing when idle, and you never provision or maintain anything.
- Zero ops requirement: If your team has no capacity to manage GPU infrastructure, Fal.ai lets you ship an image generation feature in an afternoon. No Docker, no CUDA, no SSH.
- Prototyping and demos: Quick experiments with FLUX.2, Wan, or ControlNet variants without provisioning hardware. Fal.ai's model catalog is ready immediately.
- Fal.ai-specific model fine-tunes: Some models in Fal.ai's catalog are fine-tuned or optimized by their team and are not available as open weights. If your workflow depends on one of these, there is no direct migration path.
The common thread: low volume, high ops-cost-sensitivity, or dependency on catalog-specific models.
When Self-Hosting Wins
The economics shift at scale or when you need model-level control:
- Above ~200 images/day or ~50 clips/day: At these volumes, the per-minute billing on a dedicated H100 is clearly cheaper than per-output API billing, even accounting for the time to manage the instance.
- Custom LoRA chains: Loading multiple LoRAs, custom embeddings, or checkpoint blends requires direct file system access to the model directory. No managed API supports this.
- Custom samplers and schedulers: DDPM, DPM++ 2M Karras, custom noise schedules, custom step counts outside typical ranges: these are ComfyUI configuration options, not API parameters.
- HIPAA or data residency requirements: With self-hosting, your prompts and generated images never leave your instance. For enterprise use cases with sensitive content or regulated data, that matters.
- Deterministic outputs: Same seed, same settings, same output. Managed APIs sometimes introduce non-determinism through infrastructure-level batching or version updates.
Migration Guide: Porting a Fal.ai Workflow to ComfyUI on Spheron
Step 1: Identify your checkpoint and LoRA
Look at what model you are calling on Fal.ai. FLUX.2-dev is a distinct model family from FLUX.1-dev, published by Black Forest Labs at black-forest-labs/FLUX.2-dev on HuggingFace (FLUX.2-dev, FLUX.2-klein-9B, and FLUX.2-klein-4B are separate model repositories; do not use FLUX.1-dev weights for FLUX.2 workflows as they produce different outputs). For VRAM and container requirements specific to FLUX.2-dev, see the Deploy FLUX.2 on GPU Cloud guide. For Wan video, the equivalent open-source weights are Wan-AI/Wan2.2-T2V-A14B on HuggingFace. Note the LoRA URL you pass to Fal.ai's API if any.
Step 2: Provision a GPU instance on Spheron
Go to app.spheron.ai. For FLUX.2 at 1024x1024, an H100 PCIe (80GB, $2.01/hr) or L40S (48GB, lower rate) works well. For Wan 2.2 14B at 720p, H100 PCIe is the minimum. Choose Ubuntu 22.04. SSH in when provisioning completes, typically under 2 minutes.
Step 3: Pull ComfyUI and download model weights
# Pull ComfyUI Docker image
IMAGE=ghcr.io/ai-dock/comfyui:latest-cuda
docker pull $IMAGE
docker run -d \
--name comfyui \
--gpus all \
--ipc=host \
-p 127.0.0.1:8188:8188 \
-v ~/comfyui-models:/opt/ComfyUI/models \
-v ~/comfyui-output:/opt/ComfyUI/output \
$IMAGE
# Download FLUX.2-dev weights (gated repo - accept FLUX Non-Commercial License on HuggingFace first)
pip install huggingface_hub
huggingface-cli login # provide your HF token
huggingface-cli download black-forest-labs/FLUX.2-dev \
--local-dir ~/comfyui-models/flux2-dev
# Download your LoRA file if applicable
wget -O ~/comfyui-models/loras/your-lora.safetensors YOUR_LORA_URLStep 4: Build the equivalent workflow in ComfyUI
Open ComfyUI via SSH tunnel (ssh -L 8188:localhost:8188 user@your-server-ip, then navigate to http://localhost:8188). Load a FLUX.2 base workflow from comfyworkflows.com. Map your Fal.ai API parameters to ComfyUI node settings:
| Fal.ai API parameter | ComfyUI node setting |
|---|---|
prompt | CLIPTextEncode (positive) node |
negative_prompt | CLIPTextEncode (negative) node |
num_inference_steps | KSampler steps |
guidance_scale | KSampler cfg |
image_size | EmptyLatentImage width/height |
loras[0].path | LoRA Loader node, model file path |
seed | KSampler seed |
Step 5: Call the ComfyUI API from your application
Replace your Fal.ai client calls with ComfyUI's HTTP API:
import requests
import json
import time
def generate_image(server_address, workflow_json, prompt_text):
# Inject your prompt into the workflow
workflow_json["6"]["inputs"]["text"] = prompt_text # CLIPTextEncode node ID
# Submit the generation job
response = requests.post(
f"http://{server_address}:8188/prompt",
json={"prompt": workflow_json},
timeout=30
)
response.raise_for_status()
prompt_id = response.json()["prompt_id"]
# Poll for completion (up to max_polls attempts; each poll can take up to ~11 s)
max_polls = 300
for _ in range(max_polls):
try:
history = requests.get(
f"http://{server_address}:8188/history/{prompt_id}",
timeout=10
).json()
except (requests.RequestException, ValueError):
time.sleep(1)
continue
if prompt_id in history:
outputs = history[prompt_id]["outputs"]
return outputs
time.sleep(1)
raise TimeoutError(f"Generation did not complete within {max_polls} polls")The node IDs ("6" above) come from your exported ComfyUI workflow JSON. Export any working workflow via the ComfyUI UI "Export (API format)" button to get the exact structure. For the full ComfyUI setup guide including GPU-specific configuration and workflow management, see ComfyUI on GPU Cloud 2026.
Step 6 (optional): Add a lightweight API wrapper
If your existing application expects Fal.ai's response format, add a Flask or FastAPI wrapper around the ComfyUI API call that accepts the same input shape and returns the image in the same format. Your application code then requires no changes beyond the endpoint URL.
Teams running more than a few hundred FLUX.2 generations per day find self-hosting on bare metal ~2.5x cheaper than per-output API billing. For video workflows, the advantage is model-level access to Wan configurations your API can't expose. H100 on Spheron starts at $2.01/hr with per-minute billing and no minimum commitment.
