Product

Rent NVIDIA RTX 5090: Real-World LLM Throughput vs Cost Analysis

Back to BlogWritten by Mitrasish, Co-founderMar 10, 2026
GPU CloudNVIDIA RTX 5090GPU RentalLLM InferenceBlackwellAI InfrastructureCost Optimization
Rent NVIDIA RTX 5090: Real-World LLM Throughput vs Cost Analysis

At $0.76/hr on Spheron, the RTX 5090 delivers approximately 3,500 tokens/sec on Llama 3.1 8B in FP16, roughly $0.060 per million tokens. For teams running inference on models up to 30B parameters, that cost-per-token figure beats the RTX 4090, A100 80GB, and H100 PCIe on Spheron. The Blackwell bandwidth advantage is real, and the pricing makes it available.

The RTX 5090 is NVIDIA's flagship consumer Blackwell GPU, launched in January 2025. Its 32GB of GDDR7 memory and 5th-generation Tensor Cores with native FP4 support make it a serious option for inference, fine-tuning, and diffusion workloads, at a price point far closer to a consumer card than a data center accelerator. The 1,792 GB/s memory bandwidth is 78% higher than the RTX 4090 and approaches the H100 PCIe's 2,000 GB/s, which is the number that matters most for LLM inference throughput.

Here's exactly what you can run on it, how fast, and what it costs on Spheron.

RTX 5090 Specs: What Matters for AI Workloads

Don't just read the spec sheet; here's what each number actually means for your workloads.

SpecRTX 5090RTX 4090H100 PCIe
ArchitectureBlackwellAda LovelaceHopper
VRAM32GB GDDR724GB GDDR6X80GB HBM2e
Memory Bandwidth1,792 GB/s1,008 GB/s2,000 GB/s
CUDA Cores21,76016,38414,592
Tensor Cores680 (5th Gen)512 (4th Gen)456 (4th Gen)
FP16 Tensor (TFLOPS)838.4 (sparse)330.3 (sparse)1,513 (sparse)
AI TOPS3,352 (FP4)1,321 (INT8)3,026 (INT8)
FP8 supportYesYesYes
FP4 supportYes (Blackwell native)NoNo
TDP575W450W350W
Form factorPCIePCIePCIe / SXM
NVLinkNoNoNo (PCIe) / Yes (SXM)

Note: AI TOPS figures reflect NVIDIA's official published values at each GPU's highest supported precision. The RTX 5090's 3,352 TOPS uses FP4 with sparsity (Blackwell native, not available on Ada Lovelace or Hopper). At INT8 with sparsity, the RTX 5090 delivers approximately 1,677 TOPS. RTX 4090 and H100 PCIe figures are INT8 with sparsity.

What the specs mean in practice:

32GB GDDR7 fits most models up to 30B parameters comfortably. The critical upgrade from the RTX 4090 is not just capacity (+8GB) but bandwidth: GDDR7 delivers 1,792 GB/s, 78% more than the 4090's GDDR6X. Since LLM autoregressive decoding is almost entirely memory-bandwidth-bound, this directly improves inference throughput.

Blackwell FP4 is native to the Blackwell GPU generation, including the RTX 5090, and is not available on any prior consumer GPU architecture. For models that support FP4 quantization, throughput roughly doubles compared to FP16. This matters for inference deployments that can tolerate the quantization trade-off, and it's a feature no Ada Lovelace or earlier consumer GPU supports.

No NVLink is the key limitation. Unlike H100 SXM, the RTX 5090 uses PCIe for multi-GPU communication. Running parallel independent model replicas works fine. Tensor parallelism across GPUs (splitting a single large model's layers) does not. Plan accordingly.

96MB L2 cache (up from 72MB on the RTX 4090) reduces memory bandwidth pressure for workloads with repetitive access patterns, including transformer attention. This partially explains why throughput gains outpace raw bandwidth ratios on some workloads.

What Models Fit in 32GB VRAM

ModelSizePrecisionFits on RTX 5090?Notes
Llama 3.1 8B8BFP16✅ Yes~16GB, plenty of headroom
Mistral 7B7BFP16✅ Yes~14GB
Qwen 2.5 14B14BFP16⚠️ Marginal~28GB weights; full footprint with KV cache ~34GB, requires --max-model-len limits
Qwen 2.5 32B32BQ4✅ Yes~20GB with Q4 quantization
Llama 4 Scout109B MoEINT4❌ No~55GB at INT4, exceeds 32GB
DeepSeek R1 7B7BFP16✅ Yes~14GB
Llama 3.3 70B70BFP16❌ NoNeeds ~140GB
Llama 3.3 70B70BQ4❌ No~35-40GB, exceeds 32GB limit
Flux.1 Dev-BF16⚠️ Marginal~30-33GB with T5 XXL + VAE; use FP8 for reliable operation
SDXL-FP16✅ Yes~8-10GB, comfortable

For a complete model-to-VRAM reference across the full current model landscape, see our GPU requirements cheat sheet.

Real Throughput Benchmarks

Throughput numbers below are based on vLLM benchmarks on the RTX 5090 and extrapolated from memory bandwidth ratios for model variants not directly measured. All figures represent aggregate throughput across concurrent requests at the listed batch size (not per-request latency). Actual throughput varies with prompt length, context length, driver version, and vLLM version.

ModelFrameworkBatch SizeTokens/secTTFT (ms)VRAM UsedNotes
Llama 3.1 8BvLLM32~3,500~45~16GB
Mistral 7BvLLM32~4,100~40~14GB
Qwen 2.5 14BvLLM16~1,900~80~28GBRequires --max-model-len (e.g. --max-model-len 2048) to cap KV cache; full footprint with KV cache ~34GB exceeds 32GB and causes OOM without this flag
Qwen 2.5 32B (Q4)vLLM8~1,600~65~20GB
Flux.1 Dev (FP8)ComfyUI1~23 img/min-~18GB20 diffusion sampling steps, FP8 quantization

The most striking data point: for Llama 3.1 8B, the RTX 5090 delivers approximately 46% more throughput than the RTX 4090 (which benchmarks at ~2,400 tokens/sec for 7B-class models) at only 31% higher hourly cost on Spheron. The Blackwell memory bandwidth improvement is real and directly translates to inference throughput for memory-bound workloads.

For Stable Diffusion XL, the RTX 5090 achieves 38 img/min at 1024x1024 in FP16 with 20 steps (Euler sampler), compared to 28 img/min on the RTX 4090 under identical settings, a 36% improvement driven by Blackwell's higher memory bandwidth and compute throughput. The gain is smaller than the raw bandwidth ratio (78%) because SDXL performance is partially compute-bound at these settings.

Cost Per Million Tokens: The Number That Actually Matters

This is what developers should bookmark. The formula:

Cost per 1M tokens = ($/hr) / (tokens/sec × 3600) × 1,000,000
GPU$/hr on SpheronTokens/sec (Llama 3.1 8B)Cost per 1M tokens
RTX 5090$0.76~3,500~$0.060
RTX 4090$0.58~2,400~$0.067
A100 (80GB)$1.07~2,800~$0.106
H100 PCIe$2.01~5,000~$0.112

Worked example with RTX 5090 at $0.76/hr and 3,500 tokens/sec:

Cost per 1M tokens = $0.76 / (3,500 × 3,600) × 1,000,000
                   = $0.76 / 12,600,000 × 1,000,000
                   = $0.060 per million tokens

The H100 PCIe generates faster raw throughput (~5,000 tok/s) but costs $2.01/hr, nearly 2.6x the RTX 5090's price. For small models that fit in 32GB, the H100's larger memory (80GB) provides no advantage. The result: RTX 5090 delivers better cost-per-token economics for any model under 30B parameters.

To calculate your own cost: find your model's tokens/sec on the RTX 5090, plug in $0.76/hr, and use the formula above. The same math scales to any Spheron pricing tier.

Best Use Cases for RTX 5090 on Spheron

Models up to 30B parameters

If your model fits in 32GB with your preferred quantization, the RTX 5090 gives you Blackwell performance at consumer GPU pricing. Check the VRAM table above. The 78% bandwidth advantage over the RTX 4090 directly improves your throughput-per-dollar for inference on these model sizes.

QLoRA fine-tuning on 7B-30B models

The RTX 5090's 32GB of GDDR7 and Blackwell Tensor Core architecture make it strong for adapter-based fine-tuning. Fine-tune Mistral 7B with QLoRA INT4 at approximately 720 tokens/sec training throughput, a workload that previously required stepping up to A100-class hardware or accepting slower speeds on the RTX 4090. See our LLM fine-tuning guide for setup instructions.

ComfyUI, Stable Diffusion, Flux

GDDR7 bandwidth matters for diffusion models. The RTX 5090 achieves 38 img/min on SDXL (20 steps, 1024x1024) and approximately 23 img/min on Flux.1 Dev FP8 (20 steps, 1024x1024). For teams running batch image generation pipelines, the RTX 5090 is one of the best-value options in the cloud. See our ComfyUI on GPU cloud guide for a full benchmark comparison, Docker setup instructions, and cost-per-image analysis.

Development and testing before scaling

Start development and evaluation on RTX 5090 at $0.76/hr. When production requirements exceed 32GB (serving a 70B model in FP16, for instance), Spheron lets you switch to H100 or H200 without changing platforms or re-engineering your stack.

High-volume small model APIs

If you're serving millions of Llama 3.1 8B or Mistral 7B queries per day, cost-per-token dominates total spend. At $0.060/M tokens, the RTX 5090 on Spheron delivers the lowest cost-per-token among commonly available GPU options for sub-30B inference. See our GPU cost optimization playbook for production cost analysis.

When NOT to Use RTX 5090

Be honest about the limitations:

70B models in FP16: Llama 3.3 70B in FP16 requires ~140GB VRAM, more than 4x what the RTX 5090 offers. Even Q4 quantization pushes ~35-40GB, exceeding the 32GB limit. For 70B models, the H100 (80GB) or H200 (141GB) are the correct choices.

Multi-GPU tensor parallelism: No NVLink means all GPU-to-GPU communication runs over PCIe. For models split across multiple GPUs using tensor parallelism, this becomes a bottleneck. PCIe is adequate for data parallelism (running independent replicas) but not tight model parallelism.

Enterprise production SLAs: The RTX 5090 is a consumer GPU. It lacks ECC memory, MIG multi-instance GPU partitioning, and the reliability certification that data center GPUs carry. For production deployments with formal SLA requirements, H100 is the proven choice.

Very large batch inference: At large batch sizes, the H100's HBM advantage in raw bandwidth becomes significant. The RTX 5090's GDDR7 is fast, but HBM is a different class of memory with higher peak bandwidth for sustained large-batch workloads.

How to Deploy on Spheron: Quick Start

  1. Go to app.spheron.ai and create an account with GitHub or Gmail
  2. Add credits via the credit button in the top-right corner (card or crypto)
  3. Click Deploy in the left menu to open the GPU catalog
  4. Select RTX 5090, configure vCPUs, RAM, storage, and region
  5. Add your SSH key and click Deploy Instance; your GPU VM is ready in under a minute

SSH into your instance:

bash
ssh -i <private-key-path> user@<your-vm-ip>

Launch a vLLM inference server with Llama 3.1 8B:

bash
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

For Flux.1 Dev in ComfyUI, use the standard ComfyUI Docker image; the RTX 5090's 32GB VRAM handles Flux.1 Dev (~24GB model weights at BF16) with room to spare for workflow overhead.

For detailed setup guides, see the Spheron documentation.

Pricing

Current RTX 5090 pricing on Spheron:

ConfigurationPrice/hr
RTX 5090 (PCIe)$0.76/hr
RTX 5090 bare metal (8x GPU)$7.39/hr

Note: Prices above are sourced from Spheron's live marketplace as of March 10, 2026. GPU cloud pricing fluctuates with supply and demand, rates may be lower or higher at the time you read this. Spheron aggregates supply from multiple providers, so pricing and availability vary by region. Check live pricing for current rates.

Comparison with other providers and GPU options:

OptionPrice/hr
RTX 5090 on Spheron$0.76/hr
RTX 4090 on Spheron$0.58/hr
H100 PCIe on Spheron$2.01/hr
RTX 5090 on RunPod~$1.24/hr
RTX 5090 on Lambda Labs~$1.59/hr

Note: Competitor pricing is approximate and changes frequently. Verify current rates directly at each provider's pricing page before making purchasing decisions.

For equivalent performance on Llama 3.1 8B inference, the RTX 5090 on Spheron provides the lowest cost-per-token of any option listed, including the RTX 4090, despite its lower hourly rate, because the RTX 5090 produces 46% more throughput per dollar.


RTX 5090 instances on Spheron are available now, no quota, no waitlist. Deploy in minutes.

Rent an RTX 5090 →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.