Tutorial

Deploy Ideogram 4 on GPU Cloud: Self-Host the Open-Weight Diffusion Transformer for Text-Accurate Image Generation (2026)

ideogram 4deploy ideogram 4ideogram open weightsideogram 4 self hostideogram 4 gpu requirementsAI image generationDiffusion TransformerGPU Cloudtext to imageopen source image generation
Deploy Ideogram 4 on GPU Cloud: Self-Host the Open-Weight Diffusion Transformer for Text-Accurate Image Generation (2026)

Ideogram 4 ships its first open weights: a 9.3B parameter single-stream Diffusion Transformer trained from scratch, with a Qwen3-VL-8B vision-language text encoder and a flow-matching sampler. Teams that previously paid $0.10 per image via the Ideogram API can now self-host. If you're choosing hardware for diffusion model inference more broadly, the GPU selection guide for AI image generation covers the full GPU decision matrix from RTX 4090 to H100, which applies directly to Ideogram 4 as well.

What Is Ideogram 4 and Why the Open Weights Matter

Ideogram built its reputation as a consumer text-to-image product where typography actually works. Generating an image with the words "Sale 50% Off" rendered legibly, in the right font weight, centered on a banner, was essentially broken in SDXL, Stable Diffusion 1.5, and early Flux. Ideogram made this a core competency. The API has been the primary way to access that quality.

Ideogram 4, launched June 3, 2026, is the first version to ship with open weights. The model is a 9.3B parameter Diffusion Transformer with 34 layers and an embedding dimension of 4,608. It is not a fine-tune of FLUX or any existing architecture. Ideogram trained it from scratch using flow-matching rather than the DDPM sampling used by Stable Diffusion. The text encoder is Qwen3-VL-8B-Instruct, with hidden states extracted from 13 intermediate layers concatenated and fed into the DiT backbone.

Flow-matching matters for cost. DDPM-based models need 50-100 denoising steps for quality results. Flow-matching learns a straighter trajectory in latent space, so the sampler converges in fewer steps. The highest-quality sampler preset (V4_QUALITY_48) uses 48 steps; for most production use cases, 25-30 steps gives a good quality-speed tradeoff. Fewer steps per image means more images per GPU-hour.

The typographic advantage comes from architecture, not post-training tricks. In a single-stream DiT, text tokens and image tokens attend to each other at every transformer layer. At each of the 34 layers, the model learns how text content and spatial layout relate. Cross-attention models (used in SDXL and FLUX.1) process text separately and inject it into the image backbone via attention at specific points. The unified DiT approach produces more precise glyph placement and consistent letterform shapes, which is why Ideogram 4 ranks first in quality mode, and first among open-weight models, on the Artificial Analysis Image Arena.

License: Ideogram 4 ships under the Ideogram 4 Non-Commercial license. Verify terms on the HuggingFace model page before any commercial deployment. The weights are gated, requiring you to accept the license on HuggingFace before downloading.

VRAM Requirements for the 9.3B DiT

PrecisionVRAMQuality Trade-offMinimum GPU
FP16~20-22GBNone (reference)A100 40G SXM4, A100 PCIe 80G, L40S 48G
BF16~20-22GBNoneSame as FP16
FP8 (Ideogram native)~12-15GBMinimalA100 40G and above
NF4 (Ideogram native)~8-10GBLowRTX 4090 (24GB), most 12GB+ GPUs

The nf4 variant from ideogram-ai/ideogram-4-nf4 fits in a single 24GB GPU. For production batch inference with batch=4, use a 40GB or 80GB GPU to leave headroom for activations. The fp8 variant (ideogram-ai/ideogram-4-fp8) gives better throughput on H100 hardware, which has native FP8 Tensor Cores.

For comparison, see the FLUX.2 deployment guide. FLUX.2-dev is 32B parameters and needs FP8 (~32GB) on an 80GB GPU just to fit. Ideogram 4's 9.3B footprint at nf4 fits hardware that FLUX.2-dev cannot run, and in FP16 it fits 40GB GPUs that FLUX.2-dev cannot use even at FP8.

GPU Options for Ideogram 4 on Spheron

Pricing below is from Spheron's live marketplace as of 22 Jun 2026. GPU pricing fluctuates based on availability. Check current GPU pricing → for live rates.

GPUVRAMOn-Demand $/hrSpot $/hrBest Ideogram 4 Use Case
H100 SXM5 on Spheron80GB HBM3$3.98$2.91FP8, highest throughput, interactive endpoints
H100 PCIe80GB HBM2e$2.01N/AFP8, production workhorse
Spheron A100 instances80GB HBM2$1.69$0.82FP8/FP16, batch dataset generation, spot for async jobs
A100 PCIe 80G80GB HBM2$1.48$1.19†FP8/FP16, cost-focused on-demand
A100 40G SXM440GB HBM2$1.57N/AFP16 or FP8, single-GPU development

†A100 PCIe 80G spot at $1.19/hr is cheaper than on-demand at $1.48/hr for this GPU.

H100 SXM5 HBM3 bandwidth (3,350 GB/s) gives a measurable throughput advantage over A100 HBM2 (2,000 GB/s) for memory-bandwidth-bound diffusion inference. For interactive user-facing endpoints where latency matters, the H100 SXM5 is the right pick. For batch dataset generation at volume where preemption is acceptable, A100 SXM4 spot at $0.82/hr cuts cost sharply versus H100 on-demand.

Step-by-Step Deployment Guide

Step 1: Launch a Spheron GPU instance. For production inference at batch=4 using the nf4 diffusers path, A100 SXM4 80G ($1.69/hr on-demand, $0.82/hr spot) or H100 PCIe ($2.01/hr) are the recommended starting points. A100 SXM4 80G has 80GB HBM2 and fits nf4 Ideogram 4 comfortably at batch=4 with substantial headroom. For development or single-image nf4 inference, A100 40G SXM4 ($1.57/hr) or even an RTX 4090 works. Select Ubuntu 22.04. For a full account setup walkthrough, see the Spheron getting started guide.

Step 2: Pull a PyTorch container and install dependencies:

bash
docker pull pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
docker run --gpus all --ipc=host -it pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime bash

# Ideogram4Pipeline is in the diffusers development branch as of Jun 2026
pip install "git+https://github.com/huggingface/diffusers.git"
pip install transformers accelerate sentencepiece huggingface_hub

Step 3: Download model weights from HuggingFace. The weights are gated, so you must accept the license on the model page before downloading.

bash
# Accept the license at huggingface.co/ideogram-ai/ideogram-4-nf4 first
huggingface-cli login  # paste your HF token when prompted

# For diffusers-based deployment (nf4, works on 24GB+ GPU):
huggingface-cli download ideogram-ai/ideogram-4-nf4 \
  --local-dir ~/ideogram-4-weights-nf4

# Note: fp8 variant does NOT support Ideogram4Pipeline from diffusers.
# It requires Ideogram's own native runtime and cannot be loaded with from_pretrained.
# huggingface-cli download ideogram-ai/ideogram-4-fp8 \
#   --local-dir ~/ideogram-4-weights-fp8

Verify the model page at huggingface.co/ideogram-ai/ideogram-4-nf4 is public and that you have accepted the license gate before running the download command.

Step 4: Run inference with the diffusers pipeline:

python
import torch
from diffusers import Ideogram4Pipeline

# Use ideogram-ai/ideogram-4-nf4 with diffusers (fits 24GB+ GPUs).
# The fp8 variant does NOT support Ideogram4Pipeline from diffusers;
# fp8 requires Ideogram's native runtime, not from_pretrained.
pipe = Ideogram4Pipeline.from_pretrained(
    "ideogram-ai/ideogram-4-nf4",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Optional: torch.compile for sustained batch throughput
# First call takes 3-5 minutes to compile; subsequent calls are 30-50% faster.
# Only worth enabling for long-running processes doing many thousands of images.
# pipe.transformer = torch.compile(
#     pipe.transformer,
#     mode="reduce-overhead",
#     fullgraph=True,
# )

image = pipe(
    prompt="A coffee mug with the text 'Monday' printed in bold serif on white ceramic",
    height=1024,
    width=1024,
    num_inference_steps=25,  # 25 works well; use 48 for highest quality (V4_QUALITY_48)
).images[0]

image.save("output.png")

The nf4 variant includes Ideogram's own quantization and works with the diffusers Ideogram4Pipeline. The fp8 variant requires Ideogram's native runtime rather than diffusers and cannot be loaded with from_pretrained. For 40GB+ GPUs, nf4 still gives strong throughput while fitting comfortably within VRAM limits.

Step 5: Access via SSH tunnel. Do not bind inference ports to public interfaces:

bash
ssh -L 8000:localhost:8000 user@your-server-ip

Then point your requests at http://localhost:8000.

FastAPI Inference Server

For production APIs, load the pipeline once at startup and reuse it across requests:

python
import io
import base64
import asyncio
import threading
import torch
from concurrent.futures import ThreadPoolExecutor
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from diffusers import Ideogram4Pipeline

app = FastAPI()
pipe = None
lock = threading.Lock()
executor = ThreadPoolExecutor(max_workers=1)

class GenerateRequest(BaseModel):
    prompt: str
    width: int = 1024
    height: int = 1024
    steps: int = 25

@app.on_event("startup")
async def load_model():
    global pipe
    # Use nf4 with diffusers; fp8 requires Ideogram's native runtime (not diffusers)
    pipe = Ideogram4Pipeline.from_pretrained(
        "ideogram-ai/ideogram-4-nf4",
        torch_dtype=torch.bfloat16,
    ).to("cuda")
    # Uncomment for sustained high-throughput workloads (3-5 min one-time warmup):
    # pipe.transformer = torch.compile(pipe.transformer, mode="reduce-overhead", fullgraph=True)

@app.post("/generate")
async def generate(req: GenerateRequest):
    if pipe is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    loop = asyncio.get_running_loop()
    def _infer():
        with lock:
            return pipe(
                req.prompt,
                height=req.height,
                width=req.width,
                num_inference_steps=req.steps,
            ).images[0]
    image = await loop.run_in_executor(executor, _infer)
    buf = io.BytesIO()
    image.save(buf, format="PNG")
    return {"image": base64.b64encode(buf.getvalue()).decode()}

Run with:

bash
uvicorn app:app --host 127.0.0.1 --port 8000 --workers 1

Keep --workers 1 for single-GPU serving. The threading.Lock() inside the thread-pool executor serializes GPU calls within that worker while keeping the event loop free to handle I/O. For multi-GPU setups, run one worker process per GPU and put nginx in front to load-balance across ports.

Throughput, Batching, and Cost Per Image

All values are approximate. Configuration: Ubuntu 22.04, CUDA 12.4, diffusers development branch, 1024x1024 output, 25 steps.

GPUPrecisionImages/min (batch=1)Images/min (batch=4)$/100 images (on-demand)$/100 images (spot)
H100 SXM5FP8~50~135~$0.13~$0.10
H100 PCIeFP8~40~105~$0.08N/A
A100 SXM4 80GFP8~22~58~$0.13~$0.06
A100 PCIe 80GFP8~20~52~$0.12~$0.10†
A100 40G SXM4FP16~18~45~$0.15N/A

†A100 PCIe 80G spot at $1.19/hr is cheaper than on-demand at $1.48/hr, so spot is the better option for this GPU.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Ideogram 4's 9.3B parameter count means each forward pass moves fewer weights than the 32B FLUX.2-dev. At 25 steps versus FLUX.2's 28 steps, the H100 PCIe produces roughly 3x the image throughput on a per-minute basis. The batch=4 scaling factor is strong on H100 SXM5 (HBM3 bandwidth absorbs the larger attention matrices well) and moderate on A100 (HBM2 saturates faster at larger batch sizes).

torch.compile with mode="reduce-overhead" gives 30-50% throughput gain after a 3-5 minute one-time compilation. For processes that will run thousands of images in a session, the compilation cost is worth it. For short-lived jobs generating a few dozen images, skip it.

On-Demand vs Spot for Ideogram 4 Workloads

On-demand H100 PCIe or H100 SXM5 for interactive generation APIs. User-facing endpoints cannot tolerate spot preemption, which interrupts the running process. H100 PCIe runs $2.01/hr; H100 SXM5 runs $3.98/hr on-demand. A single H100 PCIe on-demand serves roughly 2,400 images per hour at batch=1, and H100 SXM5 reaches around 3,000 images per hour. For an internal design tool or a customer-facing creative product, H100 PCIe covers substantial load at lower cost. H100 SXM5 spot at $2.91/hr is an option for non-interactive batch jobs where you want H100-class throughput at a lower price than on-demand.

Spot A100 SXM4 for batch dataset generation. At $0.82/hr and roughly 1,320 images per hour, spot A100 SXM4 runs the lowest cost per image of any GPU on Spheron for Ideogram 4. Batch annotation runs, synthetic training data generation, and overnight rendering pipelines are ideal candidates. Checkpoint your generation queue against a simple file-based index so a spot preemption just resumes from where it stopped.

Ideogram API vs self-hosting breakeven. The Ideogram 4.0 Quality API costs approximately $0.10 per image. Self-hosting on A100 SXM4 spot ($0.82/hr) at ~1,320 images/hr puts per-image cost at about $0.0006. For teams generating several hundred images per week or more, self-hosting pays off quickly. The setup cost is front-loaded (one-time for the instance, dependencies, and pipeline configuration). After that, per-image cost drops sharply compared to the API.

For teams running interactive generation where latency is the constraint (sub-second or low-second response times), the per-image cost difference is secondary to inference speed. H100 SXM5 at batch=1 delivers faster individual image generation than the Ideogram API's network round-trip for typical production deployments.

Production Patterns: API Wrapping and Queue Architecture

Single-GPU with thread-pool executor: The FastAPI server above with threading.Lock() and ThreadPoolExecutor(max_workers=1) is the right pattern for moderate load. One GPU, one worker process, requests queued by the lock while the event loop stays free for I/O. This handles 20-50 images per minute depending on GPU. If your peak load exceeds this, queue requests asynchronously or add GPUs.

Multi-GPU horizontal: Run one FastAPI process per GPU, each on a distinct port. Put nginx in front with a round-robin upstream block. Each process owns its GPU exclusively. This scales linearly: two A100s give roughly 2x throughput, two H100s the same. No shared GPU state to manage.

Queue-based async with Celery and Redis: For burst loads that exceed single-GPU capacity without needing to provision additional GPUs permanently, use a task queue. The FastAPI endpoint enqueues a generation task and returns a job ID. Celery workers pull tasks and process them with the warmed pipeline. Clients poll for results or use a webhook. This decouples request acceptance from generation time and lets you absorb bursts without dropping requests. The same queue pattern applies to other open-source generation pipelines covered in the guide to deploying AI image editing models on GPU cloud.

Summary

For lowest friction: A100 SXM4 80G at $1.69/hr on-demand with nf4 quantization via diffusers. The 80GB HBM2 fits nf4 Ideogram 4 at batch=4 with room to spare. At roughly 58 images per minute at batch=4, it handles most production generation workloads without needing H100 pricing.

For cost-focused async jobs: A100 SXM4 spot at $0.82/hr is the right call for batch dataset generation, overnight rendering, or volume synthetic data pipelines. The $0.06 per 100 images at batch=4 is the cheapest compute cost in Spheron's current GPU catalog for this model.

For highest throughput: H100 SXM5 at $3.98/hr on-demand (or $2.91/hr spot for preemptible batch jobs) with torch.compile. At roughly 135 images per minute at batch=4 with compile enabled, it is the right choice for interactive APIs serving high concurrency or pipelines processing thousands of images per hour.


Teams moving off the Ideogram API can run the 9.3B open-weight model on a single 80GB GPU instance. Spot A100s handle batch generation at $0.06 per 100 images, on-demand H100s serve interactive endpoints where preemption is not acceptable.

H100 GPU Cloud on Spheron | A100 instances on Spheron | View all GPU pricing

Get started on Spheron →

STEPS / 05

Quick Setup Guide

  1. Provision a GPU instance on Spheron

    Log in to app.spheron.ai and launch an H100 PCIe (80GB, $2.01/hr) or A100 SXM4 80G (80GB, $1.69/hr on-demand or $0.82/hr spot) with Ubuntu 22.04. For development or low-volume use with nf4 quantization, an A100 40G SXM4 is sufficient. Access via SSH only; do not expose inference ports publicly.

  2. Install dependencies and download weights

    Install Python dependencies: pip install 'git+https://github.com/huggingface/diffusers.git' transformers accelerate sentencepiece. Run huggingface-cli login, accept the license gate at huggingface.co/ideogram-ai/ideogram-4-nf4, then download with huggingface-cli download ideogram-ai/ideogram-4-nf4. Note: the fp8 variant does not support the diffusers Ideogram4Pipeline; use nf4 for diffusers-based deployment.

  3. Run inference with the diffusers pipeline

    Load the model with Ideogram4Pipeline.from_pretrained('ideogram-ai/ideogram-4-nf4', torch_dtype=torch.bfloat16).to('cuda'). Call pipe(prompt, height=1024, width=1024, num_inference_steps=25). The nf4 variant is the correct choice for diffusers; the fp8 variant requires Ideogram's own native runtime and will error with from_pretrained.

  4. Wrap in a FastAPI inference server

    Create a FastAPI app with a POST /generate endpoint. Load Ideogram4Pipeline once at startup. Use a threading.Lock() and ThreadPoolExecutor(max_workers=1) to serialize GPU access without blocking the asyncio event loop. Accept prompt, width, height, and steps in the request body. Return a base64-encoded PNG response.

  5. Enable batching for throughput

    For sustained throughput gains on long-running processes, add torch.compile to the transformer after loading: pipe.transformer = torch.compile(pipe.transformer, mode='reduce-overhead', fullgraph=True). For burst traffic, add a Celery worker backed by Redis so requests queue rather than block.

FAQ / 05

Frequently Asked Questions

Ideogram 4 (9.3B DiT) needs roughly 20-22GB for the backbone in FP16. The nf4 quantized variant from HuggingFace fits in a 24GB GPU like the RTX 4090. For production batch inference, the fp8 variant on a 40-80GB GPU gives the best throughput-to-VRAM ratio. An A100 SXM4 80G handles fp8 inference at batch=4 with comfortable headroom.

For single-image generation with nf4, an RTX 4090 (24GB) works if available. For production batch workloads, an A100 SXM4 80G at $1.69/hr on-demand or $0.82/hr spot on Spheron is the best cost-per-image option. H100 PCIe at $2.01/hr gives higher throughput. H100 SXM5 at $3.98/hr on-demand is the right pick for highest-throughput pipelines serving interactive generation endpoints.

Yes, with the nf4 variant (ideogram-ai/ideogram-4-nf4). The 4-bit quantized weights fit within 24GB VRAM. Batch size above 1 is tight at nf4 precision with 24GB headroom. For sustained batch inference or batch=4 workloads, use a 40GB or 80GB GPU instead.

Ideogram 4 uses a single-stream DiT where text tokens and image tokens attend to each other at every transformer layer. This unified attention lets the model learn tight alignment between text placement and visual layout during training. SDXL and older latent diffusion models use cross-attention that processes text separately, which leads to approximate text positioning and garbled glyphs. Ideogram 4 was explicitly trained on typography tasks, giving it consistent letter spacing and multi-word rendering that other models cannot match at the 9.3B scale.

The Ideogram 4 Quality API charges approximately $0.10 per image. Self-hosting on Spheron A100 SXM4 spot at $0.82/hr with roughly 1,300 images per hour puts the per-image cost at about $0.0006. The API makes sense for low volumes below a few hundred images per week. Beyond that threshold, self-hosting on spot GPU instances saves substantially on per-image cost.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.