How much does it cost to rent an RTX 5090 on Spheron?

RTX 5090 PCIe instances on Spheron start at $0.76/hr. Bare metal configurations (8x GPU) are available at $7.39/hr. Check the live pricing page for current rates; Spheron aggregates supply from multiple providers and typically offers competitive RTX 5090 pricing compared to alternatives like RunPod and Lambda Labs.

What is the RTX 5090 cost per million tokens for LLM inference?

For Llama 3.1 8B in FP16 at $0.76/hr on Spheron, the RTX 5090 delivers approximately 3,500 tokens/sec, costing roughly $0.060 per million tokens. This is lower than the RTX 4090 ($0.067/M), H100 PCIe ($0.112/M), and A100 80GB ($0.106/M) for the same model; the RTX 5090 delivers the best cost-per-token for small model inference.

What models fit in the RTX 5090's 32GB VRAM?

In FP16: Llama 3.1 8B (~16GB) and Mistral 7B (~14GB) fit comfortably. Qwen 2.5 14B (~28GB model weights) is marginal; the full operational footprint including KV cache is roughly 34GB, so you need to set context length limits (e.g., --max-model-len 2048 in vLLM) to avoid OOM. With Q4 quantization: Qwen 2.5 32B (~20GB) fits. Llama 3.3 70B does not fit even at Q4 (~35-40GB, exceeds 32GB). Flux.1 Dev (~24GB at BF16) and SDXL (~8GB) also fit comfortably. Llama 3.3 70B in FP16 requires ~140GB and does not fit. Llama 4 Scout (109B MoE) requires ~55GB even at INT4 and does not fit on a single RTX 5090.

How does RTX 5090 compare to RTX 4090 for AI workloads?

The RTX 5090 has 78% more memory bandwidth (1,792 vs 1,008 GB/s), 32GB vs 24GB VRAM, and 5th-generation Tensor Cores with FP4 support. In LLM inference benchmarks, the RTX 5090 delivers approximately 35-46% more throughput than the RTX 4090 for 7B-8B models. For Stable Diffusion XL, the RTX 5090 achieves 38 img/min vs 28 img/min for the RTX 4090, a 36% improvement.

Is RTX 5090 good for QLoRA fine-tuning?

Yes. The RTX 5090's 32GB GDDR7 fits QLoRA fine-tuning of models up to 30B parameters, significantly more headroom than the RTX 4090's 24GB. Blackwell's FP4 support and GDDR7 bandwidth make it one of the best consumer-class GPUs for adapter-based fine-tuning. For Mistral 7B QLoRA INT4, the RTX 5090 achieves approximately 720 tokens/sec during training.

Does RTX 5090 support multi-GPU tensor parallelism?

Not efficiently. The RTX 5090 uses PCIe (no NVLink), so multi-GPU communication runs at PCIe bandwidth rather than NVLink's 900 GB/s. This is adequate for running parallel independent replicas (data parallelism) but insufficient for tight tensor parallelism across GPUs. For multi-GPU tensor parallelism, H100 SXM or B200 with NVLink are the correct choices.

Rent NVIDIA RTX 5090: Real-World LLM Throughput vs Cost Analysis

At $0.76/hr on Spheron, the RTX 5090 delivers approximately 3,500 tokens/sec on Llama 3.1 8B in FP16, roughly $0.060 per million tokens. For teams running inference on models up to 30B parameters, that cost-per-token figure beats the RTX 4090, A100 80GB, and H100 PCIe on Spheron. The Blackwell bandwidth advantage is real, and the pricing makes it available.

The RTX 5090 is NVIDIA's flagship consumer Blackwell GPU, launched in January 2025. Its 32GB of GDDR7 memory and 5th-generation Tensor Cores with native FP4 support make it a serious option for inference, fine-tuning, and diffusion workloads, at a price point far closer to a consumer card than a data center accelerator. The 1,792 GB/s memory bandwidth is 78% higher than the RTX 4090 and approaches the H100 PCIe's 2,000 GB/s, which is the number that matters most for LLM inference throughput.

Here's exactly what you can run on it, how fast, and what it costs on Spheron.

RTX 5090 Specs: What Matters for AI Workloads

Don't just read the spec sheet; here's what each number actually means for your workloads.

Spec	RTX 5090	RTX 4090	H100 PCIe
Architecture	Blackwell	Ada Lovelace	Hopper
VRAM	32GB GDDR7	24GB GDDR6X	80GB HBM2e
Memory Bandwidth	1,792 GB/s	1,008 GB/s	2,000 GB/s
CUDA Cores	21,760	16,384	14,592
Tensor Cores	680 (5th Gen)	512 (4th Gen)	456 (4th Gen)
FP16 Tensor (TFLOPS)	838.4 (sparse)	330.3 (sparse)	1,513 (sparse)
AI TOPS	3,352 (FP4)	1,321 (INT8)	3,026 (INT8)
FP8 support	Yes	Yes	Yes
FP4 support	Yes (Blackwell native)	No	No
TDP	575W	450W	350W
Form factor	PCIe	PCIe	PCIe / SXM
NVLink	No	No	No (PCIe) / Yes (SXM)

Note: AI TOPS figures reflect NVIDIA's official published values at each GPU's highest supported precision. The RTX 5090's 3,352 TOPS uses FP4 with sparsity (Blackwell native, not available on Ada Lovelace or Hopper). At INT8 with sparsity, the RTX 5090 delivers approximately 1,677 TOPS. RTX 4090 and H100 PCIe figures are INT8 with sparsity.

What the specs mean in practice:

32GB GDDR7 fits most models up to 30B parameters comfortably. The critical upgrade from the RTX 4090 is not just capacity (+8GB) but bandwidth: GDDR7 delivers 1,792 GB/s, 78% more than the 4090's GDDR6X. Since LLM autoregressive decoding is almost entirely memory-bandwidth-bound, this directly improves inference throughput.

Blackwell FP4 is native to the Blackwell GPU generation, including the RTX 5090, and is not available on any prior consumer GPU architecture. For models that support FP4 quantization, throughput roughly doubles compared to FP16. This matters for inference deployments that can tolerate the quantization trade-off, and it's a feature no Ada Lovelace or earlier consumer GPU supports.

No NVLink is the key limitation. Unlike H100 SXM, the RTX 5090 uses PCIe for multi-GPU communication. Running parallel independent model replicas works fine. Tensor parallelism across GPUs (splitting a single large model's layers) does not. Plan accordingly.

96MB L2 cache (up from 72MB on the RTX 4090) reduces memory bandwidth pressure for workloads with repetitive access patterns, including transformer attention. This partially explains why throughput gains outpace raw bandwidth ratios on some workloads.

What Models Fit in 32GB VRAM

Model	Size	Precision	Fits on RTX 5090?	Notes
Llama 3.1 8B	8B	FP16	✅ Yes	~16GB, plenty of headroom
Mistral 7B	7B	FP16	✅ Yes	~14GB
Qwen 2.5 14B	14B	FP16	⚠️ Marginal	~28GB weights; full footprint with KV cache ~34GB, requires `--max-model-len` limits
Qwen 2.5 32B	32B	Q4	✅ Yes	~20GB with Q4 quantization
Llama 4 Scout	109B MoE	INT4	❌ No	~55GB at INT4, exceeds 32GB
DeepSeek R1 7B	7B	FP16	✅ Yes	~14GB
Llama 3.3 70B	70B	FP16	❌ No	Needs ~140GB
Llama 3.3 70B	70B	Q4	❌ No	~35-40GB, exceeds 32GB limit
Flux.1 Dev	-	BF16	⚠️ Marginal	~30-33GB with T5 XXL + VAE; use FP8 for reliable operation
SDXL	-	FP16	✅ Yes	~8-10GB, comfortable

For a complete model-to-VRAM reference across the full current model landscape, see our GPU requirements cheat sheet.

Real Throughput Benchmarks

Throughput numbers below are based on vLLM benchmarks on the RTX 5090 and extrapolated from memory bandwidth ratios for model variants not directly measured. All figures represent aggregate throughput across concurrent requests at the listed batch size (not per-request latency). Actual throughput varies with prompt length, context length, driver version, and vLLM version.

Model	Framework	Batch Size	Tokens/sec	TTFT (ms)	VRAM Used	Notes
Llama 3.1 8B	vLLM	32	~3,500	~45	~16GB
Mistral 7B	vLLM	32	~4,100	~40	~14GB
Qwen 2.5 14B	vLLM	16	~1,900	~80	~28GB	Requires `--max-model-len` (e.g. `--max-model-len 2048`) to cap KV cache; full footprint with KV cache ~34GB exceeds 32GB and causes OOM without this flag
Qwen 2.5 32B (Q4)	vLLM	8	~1,600	~65	~20GB
Flux.1 Dev (FP8)	ComfyUI	1	~23 img/min	-	~18GB	20 diffusion sampling steps, FP8 quantization

The most striking data point: for Llama 3.1 8B, the RTX 5090 delivers approximately 46% more throughput than the RTX 4090 (which benchmarks at ~2,400 tokens/sec for 7B-class models) at only 31% higher hourly cost on Spheron. The Blackwell memory bandwidth improvement is real and directly translates to inference throughput for memory-bound workloads.

For Stable Diffusion XL, the RTX 5090 achieves 38 img/min at 1024x1024 in FP16 with 20 steps (Euler sampler), compared to 28 img/min on the RTX 4090 under identical settings, a 36% improvement driven by Blackwell's higher memory bandwidth and compute throughput. The gain is smaller than the raw bandwidth ratio (78%) because SDXL performance is partially compute-bound at these settings.

Cost Per Million Tokens: The Number That Actually Matters

This is what developers should bookmark. The formula:

Cost per 1M tokens = ($/hr) / (tokens/sec × 3600) × 1,000,000

GPU	$/hr on Spheron	Tokens/sec (Llama 3.1 8B)	Cost per 1M tokens
RTX 5090	$0.76	~3,500	~$0.060
RTX 4090	$0.58	~2,400	~$0.067
A100 (80GB)	$1.07	~2,800	~$0.106
H100 PCIe	$2.01	~5,000	~$0.112

Worked example with RTX 5090 at $0.76/hr and 3,500 tokens/sec:

Cost per 1M tokens = $0.76 / (3,500 × 3,600) × 1,000,000
                   = $0.76 / 12,600,000 × 1,000,000
                   = $0.060 per million tokens

The H100 PCIe generates faster raw throughput (~5,000 tok/s) but costs $2.01/hr, nearly 2.6x the RTX 5090's price. For small models that fit in 32GB, the H100's larger memory (80GB) provides no advantage. The result: RTX 5090 delivers better cost-per-token economics for any model under 30B parameters.

To calculate your own cost: find your model's tokens/sec on the RTX 5090, plug in $0.76/hr, and use the formula above. The same math scales to any Spheron pricing tier.

Best Use Cases for RTX 5090 on Spheron

Models up to 30B parameters

If your model fits in 32GB with your preferred quantization, the RTX 5090 gives you Blackwell performance at consumer GPU pricing. Check the VRAM table above. The 78% bandwidth advantage over the RTX 4090 directly improves your throughput-per-dollar for inference on these model sizes.

QLoRA fine-tuning on 7B-30B models

The RTX 5090's 32GB of GDDR7 and Blackwell Tensor Core architecture make it strong for adapter-based fine-tuning. Fine-tune Mistral 7B with QLoRA INT4 at approximately 720 tokens/sec training throughput, a workload that previously required stepping up to A100-class hardware or accepting slower speeds on the RTX 4090. See our LLM fine-tuning guide for setup instructions.

ComfyUI, Stable Diffusion, Flux

GDDR7 bandwidth matters for diffusion models. The RTX 5090 achieves 38 img/min on SDXL (20 steps, 1024x1024) and approximately 23 img/min on Flux.1 Dev FP8 (20 steps, 1024x1024). For teams running batch image generation pipelines, the RTX 5090 is one of the best-value options in the cloud. See our ComfyUI on GPU cloud guide for a full benchmark comparison, Docker setup instructions, and cost-per-image analysis.

Development and testing before scaling

Start development and evaluation on RTX 5090 at $0.76/hr. When production requirements exceed 32GB (serving a 70B model in FP16, for instance), Spheron lets you switch to H100 or H200 without changing platforms or re-engineering your stack.

High-volume small model APIs

If you're serving millions of Llama 3.1 8B or Mistral 7B queries per day, cost-per-token dominates total spend. At $0.060/M tokens, the RTX 5090 on Spheron delivers the lowest cost-per-token among commonly available GPU options for sub-30B inference. See our GPU cost optimization playbook for production cost analysis.

When NOT to Use RTX 5090

Be honest about the limitations:

70B models in FP16: Llama 3.3 70B in FP16 requires ~140GB VRAM, more than 4x what the RTX 5090 offers. Even Q4 quantization pushes ~35-40GB, exceeding the 32GB limit. For 70B models, the H100 (80GB) or H200 (141GB) are the correct choices.

Multi-GPU tensor parallelism: No NVLink means all GPU-to-GPU communication runs over PCIe. For models split across multiple GPUs using tensor parallelism, this becomes a bottleneck. PCIe is adequate for data parallelism (running independent replicas) but not tight model parallelism.

Enterprise production SLAs: The RTX 5090 is a consumer GPU. It lacks ECC memory, MIG multi-instance GPU partitioning, and the reliability certification that data center GPUs carry. For production deployments with formal SLA requirements, H100 is the proven choice.

Very large batch inference: At large batch sizes, the H100's HBM advantage in raw bandwidth becomes significant. The RTX 5090's GDDR7 is fast, but HBM is a different class of memory with higher peak bandwidth for sustained large-batch workloads.

How to Deploy on Spheron: Quick Start

Go to app.spheron.ai and create an account with GitHub or Gmail
Add credits via the credit button in the top-right corner (card or crypto)
Click Deploy in the left menu to open the GPU catalog
Select RTX 5090, configure vCPUs, RAM, storage, and region
Add your SSH key and click Deploy Instance; your GPU VM is ready in under a minute

SSH into your instance:

bash

ssh -i <private-key-path> user@<your-vm-ip>

Launch a vLLM inference server with Llama 3.1 8B:

bash

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

For Flux.1 Dev in ComfyUI, use the standard ComfyUI Docker image; the RTX 5090's 32GB VRAM handles Flux.1 Dev (~24GB model weights at BF16) with room to spare for workflow overhead.

For detailed setup guides, see the Spheron documentation.

Pricing

Current RTX 5090 pricing on Spheron:

Configuration	Price/hr
RTX 5090 (PCIe)	$0.76/hr
RTX 5090 bare metal (8x GPU)	$7.39/hr

Note: Prices above are sourced from Spheron's live marketplace as of March 10, 2026. GPU cloud pricing fluctuates with supply and demand, rates may be lower or higher at the time you read this. Spheron aggregates supply from multiple providers, so pricing and availability vary by region. Check live pricing for current rates.

Comparison with other providers and GPU options:

Option	Price/hr
RTX 5090 on Spheron	$0.76/hr
RTX 4090 on Spheron	$0.58/hr
H100 PCIe on Spheron	$2.01/hr
RTX 5090 on RunPod	~$1.24/hr
RTX 5090 on Lambda Labs	~$1.59/hr

Note: Competitor pricing is approximate and changes frequently. Verify current rates directly at each provider's pricing page before making purchasing decisions.

For equivalent performance on Llama 3.1 8B inference, the RTX 5090 on Spheron provides the lowest cost-per-token of any option listed, including the RTX 4090, despite its lower hourly rate, because the RTX 5090 produces 46% more throughput per dollar.

RTX 5090 instances on Spheron are available now, no quota, no waitlist. Deploy in minutes.
Rent an RTX 5090 →