At $0.76/hr on Spheron, the RTX 5090 delivers approximately 3,500 tokens/sec on Llama 3.1 8B in FP16, roughly $0.060 per million tokens. For teams running inference on models up to 30B parameters, that cost-per-token figure beats the RTX 4090, A100 80GB, and H100 PCIe on Spheron. The Blackwell bandwidth advantage is real, and the pricing makes it available.
The RTX 5090 is NVIDIA's flagship consumer Blackwell GPU, launched in January 2025. Its 32GB of GDDR7 memory and 5th-generation Tensor Cores with native FP4 support make it a serious option for inference, fine-tuning, and diffusion workloads, at a price point far closer to a consumer card than a data center accelerator. The 1,792 GB/s memory bandwidth is 78% higher than the RTX 4090 and approaches the H100 PCIe's 2,000 GB/s, which is the number that matters most for LLM inference throughput.
Here's exactly what you can run on it, how fast, and what it costs on Spheron.
RTX 5090 Specs: What Matters for AI Workloads
Don't just read the spec sheet; here's what each number actually means for your workloads.
| Spec | RTX 5090 | RTX 4090 | H100 PCIe |
|---|---|---|---|
| Architecture | Blackwell | Ada Lovelace | Hopper |
| VRAM | 32GB GDDR7 | 24GB GDDR6X | 80GB HBM2e |
| Memory Bandwidth | 1,792 GB/s | 1,008 GB/s | 2,000 GB/s |
| CUDA Cores | 21,760 | 16,384 | 14,592 |
| Tensor Cores | 680 (5th Gen) | 512 (4th Gen) | 456 (4th Gen) |
| FP16 Tensor (TFLOPS) | 838.4 (sparse) | 330.3 (sparse) | 1,513 (sparse) |
| AI TOPS | 3,352 (FP4) | 1,321 (INT8) | 3,026 (INT8) |
| FP8 support | Yes | Yes | Yes |
| FP4 support | Yes (Blackwell native) | No | No |
| TDP | 575W | 450W | 350W |
| Form factor | PCIe | PCIe | PCIe / SXM |
| NVLink | No | No | No (PCIe) / Yes (SXM) |
Note: AI TOPS figures reflect NVIDIA's official published values at each GPU's highest supported precision. The RTX 5090's 3,352 TOPS uses FP4 with sparsity (Blackwell native, not available on Ada Lovelace or Hopper). At INT8 with sparsity, the RTX 5090 delivers approximately 1,677 TOPS. RTX 4090 and H100 PCIe figures are INT8 with sparsity.
What the specs mean in practice:
32GB GDDR7 fits most models up to 30B parameters comfortably. The critical upgrade from the RTX 4090 is not just capacity (+8GB) but bandwidth: GDDR7 delivers 1,792 GB/s, 78% more than the 4090's GDDR6X. Since LLM autoregressive decoding is almost entirely memory-bandwidth-bound, this directly improves inference throughput.
Blackwell FP4 is native to the Blackwell GPU generation, including the RTX 5090, and is not available on any prior consumer GPU architecture. For models that support FP4 quantization, throughput roughly doubles compared to FP16. This matters for inference deployments that can tolerate the quantization trade-off, and it's a feature no Ada Lovelace or earlier consumer GPU supports.
No NVLink is the key limitation. Unlike H100 SXM, the RTX 5090 uses PCIe for multi-GPU communication. Running parallel independent model replicas works fine. Tensor parallelism across GPUs (splitting a single large model's layers) does not. Plan accordingly.
96MB L2 cache (up from 72MB on the RTX 4090) reduces memory bandwidth pressure for workloads with repetitive access patterns, including transformer attention. This partially explains why throughput gains outpace raw bandwidth ratios on some workloads.
What Models Fit in 32GB VRAM
| Model | Size | Precision | Fits on RTX 5090? | Notes |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | FP16 | ✅ Yes | ~16GB, plenty of headroom |
| Mistral 7B | 7B | FP16 | ✅ Yes | ~14GB |
| Qwen 2.5 14B | 14B | FP16 | ⚠️ Marginal | ~28GB weights; full footprint with KV cache ~34GB, requires --max-model-len limits |
| Qwen 2.5 32B | 32B | Q4 | ✅ Yes | ~20GB with Q4 quantization |
| Llama 4 Scout | 109B MoE | INT4 | ❌ No | ~55GB at INT4, exceeds 32GB |
| DeepSeek R1 7B | 7B | FP16 | ✅ Yes | ~14GB |
| Llama 3.3 70B | 70B | FP16 | ❌ No | Needs ~140GB |
| Llama 3.3 70B | 70B | Q4 | ❌ No | ~35-40GB, exceeds 32GB limit |
| Flux.1 Dev | - | BF16 | ⚠️ Marginal | ~30-33GB with T5 XXL + VAE; use FP8 for reliable operation |
| SDXL | - | FP16 | ✅ Yes | ~8-10GB, comfortable |
For a complete model-to-VRAM reference across the full current model landscape, see our GPU requirements cheat sheet.
Real Throughput Benchmarks
Throughput numbers below are based on vLLM benchmarks on the RTX 5090 and extrapolated from memory bandwidth ratios for model variants not directly measured. All figures represent aggregate throughput across concurrent requests at the listed batch size (not per-request latency). Actual throughput varies with prompt length, context length, driver version, and vLLM version.
| Model | Framework | Batch Size | Tokens/sec | TTFT (ms) | VRAM Used | Notes |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | vLLM | 32 | ~3,500 | ~45 | ~16GB | |
| Mistral 7B | vLLM | 32 | ~4,100 | ~40 | ~14GB | |
| Qwen 2.5 14B | vLLM | 16 | ~1,900 | ~80 | ~28GB | Requires --max-model-len (e.g. --max-model-len 2048) to cap KV cache; full footprint with KV cache ~34GB exceeds 32GB and causes OOM without this flag |
| Qwen 2.5 32B (Q4) | vLLM | 8 | ~1,600 | ~65 | ~20GB | |
| Flux.1 Dev (FP8) | ComfyUI | 1 | ~23 img/min | - | ~18GB | 20 diffusion sampling steps, FP8 quantization |
The most striking data point: for Llama 3.1 8B, the RTX 5090 delivers approximately 46% more throughput than the RTX 4090 (which benchmarks at ~2,400 tokens/sec for 7B-class models) at only 31% higher hourly cost on Spheron. The Blackwell memory bandwidth improvement is real and directly translates to inference throughput for memory-bound workloads.
For Stable Diffusion XL, the RTX 5090 achieves 38 img/min at 1024x1024 in FP16 with 20 steps (Euler sampler), compared to 28 img/min on the RTX 4090 under identical settings, a 36% improvement driven by Blackwell's higher memory bandwidth and compute throughput. The gain is smaller than the raw bandwidth ratio (78%) because SDXL performance is partially compute-bound at these settings.
Cost Per Million Tokens: The Number That Actually Matters
This is what developers should bookmark. The formula:
Cost per 1M tokens = ($/hr) / (tokens/sec × 3600) × 1,000,000| GPU | $/hr on Spheron | Tokens/sec (Llama 3.1 8B) | Cost per 1M tokens |
|---|---|---|---|
| RTX 5090 | $0.76 | ~3,500 | ~$0.060 |
| RTX 4090 | $0.58 | ~2,400 | ~$0.067 |
| A100 (80GB) | $1.07 | ~2,800 | ~$0.106 |
| H100 PCIe | $2.01 | ~5,000 | ~$0.112 |
Worked example with RTX 5090 at $0.76/hr and 3,500 tokens/sec:
Cost per 1M tokens = $0.76 / (3,500 × 3,600) × 1,000,000
= $0.76 / 12,600,000 × 1,000,000
= $0.060 per million tokensThe H100 PCIe generates faster raw throughput (~5,000 tok/s) but costs $2.01/hr, nearly 2.6x the RTX 5090's price. For small models that fit in 32GB, the H100's larger memory (80GB) provides no advantage. The result: RTX 5090 delivers better cost-per-token economics for any model under 30B parameters.
To calculate your own cost: find your model's tokens/sec on the RTX 5090, plug in $0.76/hr, and use the formula above. The same math scales to any Spheron pricing tier.
Best Use Cases for RTX 5090 on Spheron
Models up to 30B parameters
If your model fits in 32GB with your preferred quantization, the RTX 5090 gives you Blackwell performance at consumer GPU pricing. Check the VRAM table above. The 78% bandwidth advantage over the RTX 4090 directly improves your throughput-per-dollar for inference on these model sizes.
QLoRA fine-tuning on 7B-30B models
The RTX 5090's 32GB of GDDR7 and Blackwell Tensor Core architecture make it strong for adapter-based fine-tuning. Fine-tune Mistral 7B with QLoRA INT4 at approximately 720 tokens/sec training throughput, a workload that previously required stepping up to A100-class hardware or accepting slower speeds on the RTX 4090. See our LLM fine-tuning guide for setup instructions.
ComfyUI, Stable Diffusion, Flux
GDDR7 bandwidth matters for diffusion models. The RTX 5090 achieves 38 img/min on SDXL (20 steps, 1024x1024) and approximately 23 img/min on Flux.1 Dev FP8 (20 steps, 1024x1024). For teams running batch image generation pipelines, the RTX 5090 is one of the best-value options in the cloud. See our ComfyUI on GPU cloud guide for a full benchmark comparison, Docker setup instructions, and cost-per-image analysis.
Development and testing before scaling
Start development and evaluation on RTX 5090 at $0.76/hr. When production requirements exceed 32GB (serving a 70B model in FP16, for instance), Spheron lets you switch to H100 or H200 without changing platforms or re-engineering your stack.
High-volume small model APIs
If you're serving millions of Llama 3.1 8B or Mistral 7B queries per day, cost-per-token dominates total spend. At $0.060/M tokens, the RTX 5090 on Spheron delivers the lowest cost-per-token among commonly available GPU options for sub-30B inference. See our GPU cost optimization playbook for production cost analysis.
When NOT to Use RTX 5090
Be honest about the limitations:
70B models in FP16: Llama 3.3 70B in FP16 requires ~140GB VRAM, more than 4x what the RTX 5090 offers. Even Q4 quantization pushes ~35-40GB, exceeding the 32GB limit. For 70B models, the H100 (80GB) or H200 (141GB) are the correct choices.
Multi-GPU tensor parallelism: No NVLink means all GPU-to-GPU communication runs over PCIe. For models split across multiple GPUs using tensor parallelism, this becomes a bottleneck. PCIe is adequate for data parallelism (running independent replicas) but not tight model parallelism.
Enterprise production SLAs: The RTX 5090 is a consumer GPU. It lacks ECC memory, MIG multi-instance GPU partitioning, and the reliability certification that data center GPUs carry. For production deployments with formal SLA requirements, H100 is the proven choice.
Very large batch inference: At large batch sizes, the H100's HBM advantage in raw bandwidth becomes significant. The RTX 5090's GDDR7 is fast, but HBM is a different class of memory with higher peak bandwidth for sustained large-batch workloads.
How to Deploy on Spheron: Quick Start
- Go to app.spheron.ai and create an account with GitHub or Gmail
- Add credits via the credit button in the top-right corner (card or crypto)
- Click Deploy in the left menu to open the GPU catalog
- Select RTX 5090, configure vCPUs, RAM, storage, and region
- Add your SSH key and click Deploy Instance; your GPU VM is ready in under a minute
SSH into your instance:
ssh -i <private-key-path> user@<your-vm-ip>Launch a vLLM inference server with Llama 3.1 8B:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90For Flux.1 Dev in ComfyUI, use the standard ComfyUI Docker image; the RTX 5090's 32GB VRAM handles Flux.1 Dev (~24GB model weights at BF16) with room to spare for workflow overhead.
For detailed setup guides, see the Spheron documentation.
Pricing
Current RTX 5090 pricing on Spheron:
| Configuration | Price/hr |
|---|---|
| RTX 5090 (PCIe) | $0.76/hr |
| RTX 5090 bare metal (8x GPU) | $7.39/hr |
Note: Prices above are sourced from Spheron's live marketplace as of March 10, 2026. GPU cloud pricing fluctuates with supply and demand, rates may be lower or higher at the time you read this. Spheron aggregates supply from multiple providers, so pricing and availability vary by region. Check live pricing for current rates.
Comparison with other providers and GPU options:
| Option | Price/hr |
|---|---|
| RTX 5090 on Spheron | $0.76/hr |
| RTX 4090 on Spheron | $0.58/hr |
| H100 PCIe on Spheron | $2.01/hr |
| RTX 5090 on RunPod | ~$1.24/hr |
| RTX 5090 on Lambda Labs | ~$1.59/hr |
Note: Competitor pricing is approximate and changes frequently. Verify current rates directly at each provider's pricing page before making purchasing decisions.
For equivalent performance on Llama 3.1 8B inference, the RTX 5090 on Spheron provides the lowest cost-per-token of any option listed, including the RTX 4090, despite its lower hourly rate, because the RTX 5090 produces 46% more throughput per dollar.
RTX 5090 instances on Spheron are available now, no quota, no waitlist. Deploy in minutes.
