NVIDIA H100 vs RTX 4090 for AI: Data Center vs Consumer GPU Compared (2026)

A team training a 7B model on an RTX 4090 hits a wall when they want to run the same job on a 70B model or scale to multi-GPU. The question becomes: rent an H100 for the training run, or buy another 4090? This post gives you the numbers to make that call, covering VRAM ceilings, inference throughput, cost per million tokens, and the training scenarios where the 4090 simply cannot compete.

Short answer: the 4090 wins for development, sub-13B inference, and cost-sensitive serving on smaller models. The H100 wins for 70B+ workloads, production serving, and anything requiring FP8 or NVLink.

Quick Answer

Scenario	Best GPU	Why
Sub-13B dev and local inference	RTX 4090	Lower $/hr, sufficient VRAM for 7B FP16 or 13B Q4
70B+ training or inference	H100 SXM5	70B Q4 needs 35 GB, exceeds 4090's 24 GB
QLoRA fine-tuning up to 20B	RTX 4090	24 GB handles most QLoRA workloads, lower hourly rate
Production multi-user serving	H100 SXM5	ECC, MIG partitioning, NVLink, FP8 throughput

Specs Side by Side

Specification	RTX 4090	H100 SXM5
Architecture	Ada Lovelace (AD102)	Hopper (GH100)
VRAM	24 GB GDDR6X	80 GB HBM3
Memory Bandwidth	1,008 GB/s	3,350 GB/s
FP16 Tensor TFLOPS	165.2	1,979
BF16 Tensor TFLOPS	165.2	1,979
FP8 Tensor TFLOPS	~660 TFLOPS (no Transformer Engine)	3,958
Tensor Core Generation	4th Gen	4th Gen
NVLink	No	Yes (900 GB/s, NVLink 4.0)
MIG Support	No	Yes (up to 7 instances)
ECC Memory	No	Yes
Transformer Engine	No	Yes
TDP	450W	700W
PCIe Generation	Gen 4 x16	Gen 5 (on PCIe variant)
Spheron On-Demand Price	From $0.55/hr	From $3.84/hr
Retail/Purchase Price	~$1,599-$2,000	~$30,000+

Pricing as of 25 May 2026. Check current GPU pricing → for live rates.

One clarification on FP8: the RTX 4090 has FP8 Tensor Core support (hence 1,321 AI TOPS), but it does not have NVIDIA's Transformer Engine. That is a hardware-plus-software pipeline specific to Hopper that automatically manages FP8 scaling factors across attention and feed-forward layers. The H100 reaches 3,958 TFLOPS on actual transformer operations through this pipeline. The 4090's FP8 is raw Tensor Core arithmetic without the engine that makes it practical for LLM workloads.

Also worth noting: the RTX 4090 has no NVLink whatsoever. The RTX 3090 had an NVLink bridge option, but NVIDIA removed it entirely on the 4090. Multi-GPU on a 4090 cluster is PCIe Gen 4 only.

What Fits: VRAM Ceiling by Model

Model	4090 FP16	4090 Q4	H100 FP16	H100 Q4
Llama 3.1 8B	Yes (~16 GB)	Yes (~4 GB)	Yes	Yes
Llama 3.1 13B	No (~26 GB)	Yes (~7 GB)	Yes	Yes
Qwen3 32B	No (~64 GB)	Marginal (~20 GB weights, OOM with KV cache)	Yes	Yes
Llama 3.3 70B	No (~140 GB)	No (~35 GB)	No (~140 GB)	Yes (~35 GB)
Llama 3 405B	No	No	No (~810 GB)	No (~200 GB)

For Llama 3.3 70B at Q4_K_M: model weights need roughly 35 GB (70B * 0.5 bytes average = 35 GB). This exceeds the 4090's 24 GB completely. On the H100's 80 GB, it fits with 45 GB to spare for the KV cache. On the 4090, even two cards (48 GB combined via PCIe) can technically load it, but the PCIe bandwidth makes tensor parallelism impractical for real throughput.

For Qwen3 32B on the 4090: the Q4 weights (~20 GB) fit in 24 GB, but vLLM's default context lengths add several GB of KV cache on top, pushing past the limit. You can set --max-model-len 2048 to work around this, which is fine for short-context tasks but not practical for production serving. For the full model capacity matrix across GPUs, see GPU memory requirements for LLMs.

Training: Where the 4090 Cannot Go

For anything that fits in 24 GB, the 4090 is a capable fine-tuning GPU. QLoRA on models up to 20B, LoRA on models up to 13B, and full fine-tuning on models up to 3B all work fine. The RTX 4090 for AI/ML full specs and benchmarks covers these in detail.

The hard limits show up in three places.

NVLink absence. PCIe Gen 4 gives 32 GB/s per direction between GPUs. The H100's NVLink 4.0 gives 900 GB/s total bidirectional bandwidth. In an 8-GPU training cluster, the all-reduce operations that synchronize gradients across GPUs are bandwidth-limited. A 4090 cluster can do data parallelism (send gradients, not activations), but tensor parallelism and pipeline parallelism require bandwidth that PCIe cannot provide at scale. The H100's NVLink fabric is what makes 8-GPU tensor parallelism practical on 70B+ models.

No FP8 Transformer Engine. For training with FP8, the H100 reaches 3,958 TFLOPS on attention and feed-forward layers. The 4090 does not have this pipeline, so BF16 (165.2 TFLOPS) is its practical ceiling for training. For a detailed explanation of FP8 and the Transformer Engine, see our FP8 quantization guide.

Multi-node scaling. The H100 SXM5 form factor connects to NVSwitch via NVLink and supports InfiniBand for inter-node communication. The 4090 is single-node PCIe only. If your training job needs more than one physical machine, you need data center hardware.

Training throughput comparison for representative workloads:

Workload	RTX 4090	H100 SXM5	8x H100 SXM5
Llama 3.1 8B QLoRA (tok/s)	~520	~1,800	~12,000
Llama 3.3 70B BF16 LoRA (tok/s)	Not possible	Not possible (single GPU)	~38,000
Llama 3.3 70B FP8 LoRA (tok/s)	Not possible	Not possible (single GPU)	~62,000

Training benchmarks for 70B on 8x H100 from published MLPerf training v4.0 results and community benchmarks at rank 64 LoRA, 4K sequence length, gradient checkpointing. Single-GPU QLoRA estimates based on memory bandwidth ratios.

Inference: Where the 4090 Punches Above Its Weight

For single-user inference on 7B-13B models, the 4090's 1,008 GB/s bandwidth is genuinely competitive. Running Llama 8B Q4_K_M with llama.cpp or Ollama:

Model	RTX 4090 (single user)	H100 SXM5 (single user)
Llama 3.1 8B Q4	80-120 tok/s	350-450 tok/s
Llama 3.1 13B Q4	40-60 tok/s	180-240 tok/s
Llama 3.3 70B Q4	Not possible	90-110 tok/s

For personal/dev use on sub-13B models, 80-120 tok/s is interactive speed. Nobody needs 400 tok/s to iterate on prompts. The H100 is faster, but the 4090 at $0.55/hr is more economical for solo development.

The picture changes for multi-user serving with vLLM. At high batch sizes, the H100's bandwidth advantage compounds and its higher compute TFLOPS become relevant for prefill operations:

Model	RTX 4090 (vLLM, batched)	H100 SXM5 (vLLM, batched)
Llama 3.1 8B FP16	~2,550 tok/s	~8,000 tok/s
Llama 3.1 8B FP8	N/A	~12,000 tok/s
Llama 3.3 70B Q4	N/A	~1,500 tok/s

For serving Llama 8B at scale, the cost-per-token picture depends on the hourly rate. See the next section for the full math.

The 4090's bandwidth (1,008 GB/s) is genuinely strong for a $0.55/hr GPU: it is the same bandwidth as a server-grade PCIe GPU that cost $8,000 a few years ago. For small-model dev inference, it holds its own.

Multi-GPU Reality: 8x 4090 vs 1x H100 vs 8x H100

When teams outgrow a single GPU, the choice gets more interesting.

Configuration	Total VRAM	All-Reduce Bandwidth	Max Single Model	Monthly (40 hrs/wk)
8x RTX 4090 (PCIe)	192 GB	32 GB/s (PCIe Gen 4)	~70B Q4 (model parallelism required)	~$704
1x H100 SXM5	80 GB	N/A (single GPU)	~70B Q4	~$614
8x H100 SXM5	640 GB	900 GB/s (NVLink 4.0)	405B Q4	~$4,920

8x H100 SXM5 monthly estimate based on $30.75/hr cluster rate (cheapest 8-GPU offer at time of writing). 8x RTX 4090 at $0.55/hr x 8 = $4.40/hr.

Eight RTX 4090s via PCIe can technically load a 70B model across cards (via tensor parallelism or pipeline parallelism), but the 32 GB/s PCIe bottleneck makes tensor parallelism slow. What 8x 4090 does well: data-parallel inference of independent requests on sub-24B models, where each card handles separate requests without cross-GPU communication. For training workloads requiring gradient synchronization at scale, PCIe bandwidth is the bottleneck and performance falls significantly below NVLink clusters.

A single H100 SXM5 covers most teams: it handles 70B Q4 inference on a single card and provides ECC memory and MIG partitioning that 8x 4090 cannot. The 8x H100 cluster is for training 70B+ models in BF16 or FP8, or serving 405B+ models in production.

Cost Per Million Tokens: The Real Math

Using live Spheron on-demand pricing as of 25 May 2026.

Formula: Cost/M tokens = hourly_rate / (tokens_per_second x 3,600) x 1,000,000

Llama 3.1 8B, batched vLLM:

Configuration	$/hr	tok/s	Cost/1M tokens
RTX 4090 on-demand	$0.55	2,550	$0.060
H100 SXM5 on-demand (FP16)	$3.84	8,000	$0.133
H100 SXM5 on-demand (FP8)	$3.84	12,000	$0.089

At current on-demand prices, the RTX 4090 is cheaper per token for Llama 8B, even against the H100 at full batch throughput. The H100 FP8 narrows the gap to $0.089/M tokens but still costs more than the 4090. The H100's value shifts to workloads the 4090 cannot handle: models above 24 GB VRAM, FP8 training with the Transformer Engine, and high-concurrency production serving where its MIG and ECC features matter.

Llama 3.3 70B Q4, batched vLLM (single GPU):

Configuration	$/hr	tok/s	Cost/1M tokens
H100 SXM5 on-demand	$3.84	1,500	$0.711
RTX 4090	N/A	N/A	Not possible

Owned RTX 4090 depreciation math:

Hardware: $2,000 system (GPU + workstation) over 3 years = $0.076/hr in depreciation
Electricity: $0.15/kWh x 450W = $0.068/hr at full load (assumes $0.15/kWh, adjust for your region)
Total effective cost at 100% utilization: ~$0.144/hr

At 2,550 tok/s and $0.144/hr: $0.016/M tokens. Owned 4090 beats every rental option at 100% utilization. At 50% utilization (more realistic): fixed costs stay the same while billable hours halve, so effective rate is ~$0.288/hr = $0.031/M tokens. Still competitive. The break-even point against H100 rental depends on your actual utilization and the specific task.

Pricing fluctuates based on GPU availability. The prices above are based on 25 May 2026 and may have changed. Check current GPU pricing → for live rates.

For a full cost-per-token comparison across more GPU models and quantization levels, see GPU cost-per-token benchmarks for LLM inference 2026.

When the 4090 Wins

Solo experimentation and prototyping on sub-13B models
Single-user or low-concurrency inference at $0.060/M tokens, the cheapest on-demand rental option for sub-13B models
QLoRA fine-tuning up to 20B parameters
Hobbyist and researcher workloads where the priority is low hourly cost, not maximum throughput
Always-on local development environments where owned hardware makes sense at high daily utilization

When the H100 Wins

Training runs on 30B+ models, especially with FP8 via the Transformer Engine
70B+ inference in production: the H100 is the smallest single GPU that comfortably runs Llama 3.3 70B at Q4
Production multi-user serving with ECC memory and MIG partitioning for multi-tenancy
Multi-GPU clusters where NVLink bandwidth enables tensor parallelism
Short-burst training runs: renting H100 on-demand only when you need it means no depreciation, no electricity costs, and no hardware commitment

For a detailed breakdown of the H100 SXM5 architecture including all precision tiers, see the H100 complete datasheet. For how the H100 compares to its predecessor, see the A100 vs H100 comparison.

Hybrid Play: 4090 for Dev, Spheron H100 for Training

The most cost-efficient pattern for small teams: keep a 4090 for daily development and fast iteration on sub-13B models, then rent H100 SXM5 on Spheron only for full fine-tuning runs.

Example monthly cost breakdown for a 3-person AI team:

Scenario	Monthly Compute Cost
Rent RTX 4090 on Spheron for 8 hrs/day dev (20 days)	$0.55 x 160 hrs = $88
Rent H100 SXM5 on Spheron for 20 hrs of training runs	$3.84 x 20 hrs = $76.80
Total	~$165/mo

Compare that to renting an H100 full-time for all work (including idle dev hours):

$3.84 x 24 hrs x 30 days = $2,764.80/mo

The hybrid approach cuts cost by 94% while still giving you H100 capability when the run actually needs it. For the fine-tuning setup itself, see our complete LLM fine-tuning guide.

You can rent H100 on Spheron on-demand with per-minute billing, meaning you only pay for the hours the training job actually runs. Current H100 pricing applies.

Training Runs: A Worked Example

A typical LoRA fine-tuning run on Llama 3.1 70B takes approximately 8-12 hours on 8x H100 SXM5, depending on dataset size and sequence length. At the cluster rate of $30.75/hr, a 10-hour run costs $307.50. That is less than the purchase cost of a single used RTX 4090.

For the same run on a 4090 cluster: 70B BF16 requires 8x H100 NVLink for reasonable throughput. An 8x 4090 PCIe cluster can technically run QLoRA on a smaller model but cannot handle 70B BF16 at useful speeds due to PCIe bandwidth. The workloads are not directly comparable.

For inference-only use on 70B models, the H100 SXM5 at $3.84/hr is the single-GPU option. An 8x 4090 cluster at $4.40/hr can technically serve 70B Q4 across cards via tensor or pipeline parallelism (model sharding across cards), but the 32 GB/s PCIe bottleneck makes this impractical for production throughput compared to a single H100's 80 GB on-card memory.

If you are deciding between generations of consumer GPUs, you can rent the RTX 5090 on Spheron or see the RTX 5090 vs RTX 4090 AI benchmark for a side-by-side with Blackwell's consumer GPU, including Qwen3 32B throughput and QLoRA fine-tuning up to 30B. If you are on the fence between the H100 and its predecessor, the A100 vs H100 breakdown covers the cost-per-token differences in more detail.

For the RTX 4090 on Spheron, see the RTX 4090 rental page.

Teams that develop on a 4090 and train on H100 on-demand get the best of both worlds: low daily dev costs and full data center capability when they need it. H100 SXM5 on Spheron → | Rent RTX 4090 → | View all GPU pricing →
Get started on Spheron →

FAQ / 05

Frequently Asked Questions

For models up to 7B in FP16, or up to 13B in Q4, the RTX 4090 is a reasonable substitute for development and single-user inference. For 70B+ models, multi-GPU tensor parallelism, production serving with SLAs, or FP8 training with the Transformer Engine, the RTX 4090 cannot replace the H100. The 24 GB VRAM ceiling and the absence of NVLink and ECC memory are hard limits, not configuration problems.

On the RTX 4090 (24 GB): Llama 3.1 13B in Q4 fits comfortably. Qwen3 32B at Q4/AWQ fits the weights (~20 GB) but runs out of memory for KV cache at default context lengths in vLLM. Llama 3.3 70B in any quantization does not fit. On the H100 SXM5 (80 GB): Llama 3.3 70B at Q4 fits with room for KV cache (~35 GB model weights). Llama 3 405B requires multiple GPUs even on the H100.

It depends on utilization. An owned RTX 4090 workstation ($2,000 hardware, 3-year life) costs roughly $0.14/hr in amortized hardware and electricity at full utilization. At 2,550 tok/s FP16 (batched vLLM), that comes to about $0.016/M tokens, far cheaper than any rental. But most teams are not at 100% GPU utilization. For burst workloads, training runs, or multi-GPU needs, renting H100 on-demand at $3.84/hr only when you need it often costs less than buying and underutilizing hardware.

No. The RTX 4090 has fourth-generation Tensor Cores that support FP8 arithmetic, which is reflected in its 1,321 AI TOPS figure. But it does not have the Transformer Engine, which is a hardware-plus-software pipeline specific to Hopper architecture GPUs (H100 and H200). The Transformer Engine automatically manages FP8 scaling factors across attention and feed-forward layers, enabling the H100 to hit 3,958 TFLOPS on actual transformer workloads. The 4090 cannot use this pipeline regardless of software version.

Develop on a 4090 (owned or rented at $0.55/hr) for day-to-day work on sub-13B models, prompt testing, and fast iteration. Rent H100 SXM5 on Spheron on-demand only for full fine-tuning runs that need the extra VRAM, FP8 throughput, or multi-GPU scaling. This hybrid approach avoids paying H100 rates during idle hours while giving you data center capability when the run actually needs it.

Quick Answer

Specs Side by Side

What Fits: VRAM Ceiling by Model

Training: Where the 4090 Cannot Go

Inference: Where the 4090 Punches Above Its Weight

Multi-GPU Reality: 8x 4090 vs 1x H100 vs 8x H100

Cost Per Million Tokens: The Real Math

When the 4090 Wins

When the H100 Wins

Hybrid Play: 4090 for Dev, Spheron H100 for Training

Training Runs: A Worked Example

Related Comparisons

Frequently Asked Questions

01Can the RTX 4090 replace an H100 for AI?

02What is the largest model I can run on the RTX 4090 vs H100?

03Is renting an H100 cheaper than buying an RTX 4090 for AI?

04Does the RTX 4090 have the FP8 Transformer Engine?

05What is the best GPU workflow for a small AI team on a budget?

Build what's next.