1,792 GB/s of memory bandwidth and 32GB of GDDR7 are the two specs that define what the RTX 5090 can do for AI workloads. That bandwidth figure is 78% higher than the RTX 4090 and puts the 5090 within striking distance of the H100 PCIe (2,000 GB/s) at a fraction of the rental cost. The 32GB gives you room for 13B models at full FP16 precision and 32B models under Q4 quantization on a single card. The full datasheet is below.
RTX 5090 Specs: Full Datasheet
| Specification | Value |
|---|---|
| Architecture | Blackwell |
| GPU Die | GB202 |
| Process Node | TSMC 4NP |
| Transistors | 92.2 billion |
| Streaming Multiprocessors | 170 |
| CUDA Cores | 21,760 |
| 5th-Gen Tensor Cores | 680 |
| 4th-Gen RT Cores | 170 |
| VRAM | 32GB GDDR7 |
| Memory Bus Width | 512-bit |
| Memory Bandwidth | 1,792 GB/s |
| FP32 Performance | ~105.2 TFLOPS |
| FP16 Tensor Performance (sparse) | 838.4 TFLOPS |
| FP8 Tensor Performance (sparse) | ~1,676.8 TFLOPS |
| FP4 Tensor Performance (sparse) | 3,352 TOPS |
| AI TOPS | 3,352 |
| TDP | 575W |
| Power Connector | 16-pin (12V-2x6) |
| PCIe Generation | Gen 5 x16 |
| DLSS | 4 |
| Encoder/Decoder | 2x NVENC / 2x NVDEC |
Blackwell Architecture: GB202 Die
The RTX 5090 runs on NVIDIA's Blackwell architecture, specifically the GB202 die. The key upgrade from Ada Lovelace (RTX 4090) is the jump to 5th-generation Tensor Cores with native FP4 support. Ada Lovelace stopped at FP8; Blackwell adds a full FP4 precision tier that doubles effective throughput again for models that support it.
The GB202 die packs 92.2 billion transistors on TSMC's 4NP node into 170 streaming multiprocessors. Each SM contributes 128 CUDA cores, 4 Tensor Cores (5th gen), and 1 RT Core (4th gen). More SMs means more parallelism for both compute and inference workloads that scale well across threads.
The 4th-gen RT Cores improve ray-triangle intersection throughput over Ada's 3rd-gen, relevant for rendering but not LLM inference. DLSS 4 switches from a CNN to a transformer-based super-resolution model, enabling multi-frame generation. That's a consumer feature; for AI workloads, the Tensor Core upgrade is what matters.
Memory Subsystem: 32GB GDDR7 and 1,792 GB/s Bandwidth
For LLM inference, memory bandwidth is the primary bottleneck. Token generation is memory-bound: every forward pass requires loading model weights from VRAM, and bandwidth determines how fast that transfer happens. CUDA core count matters less than how quickly you can move data.
The RTX 5090 pairs a 512-bit bus with GDDR7 speeds to hit 1,792 GB/s. Here is how that compares to adjacent GPUs:
| GPU | VRAM | Bandwidth |
|---|---|---|
| RTX 5090 | 32GB GDDR7 | 1,792 GB/s |
| RTX 4090 | 24GB GDDR6X | 1,008 GB/s |
| H100 PCIe | 80GB HBM2e | 2,000 GB/s |
| H100 SXM5 | 80GB HBM3 | 3,350 GB/s |
The RTX 5090 reaches 89% of H100 PCIe bandwidth and 54% of H100 SXM5 bandwidth. Given the rental cost gap between the 5090 and either H100 variant, that bandwidth ratio translates to very favorable cost-per-token numbers for models under 32B parameters. The 8GB VRAM advantage over the RTX 4090 also matters for 13B-32B inference where 24GB is tight.
5th-Gen Tensor Cores: FP4, FP8, FP16, BF16, INT8 Throughput
The jump from 4th to 5th-gen Tensor Cores brings two changes: native FP4 support and higher per-SM throughput at FP8 and FP16. The precision table below shows where the RTX 5090 lands relative to the RTX 4090 and H100 PCIe:
| Precision | RTX 5090 | RTX 4090 | H100 PCIe |
|---|---|---|---|
| FP4 (sparse) | 3,352 TOPS | Not supported | Not supported |
| FP8 (sparse) | ~1,676 TFLOPS | ~1,321 TFLOPS | 3,026 TFLOPS |
| FP16 Tensor (sparse) | 838.4 TFLOPS | 330.3 TFLOPS | 1,513 TFLOPS |
| BF16 Tensor (sparse) | 838.4 TFLOPS | 330.3 TFLOPS | 1,513 TFLOPS |
| INT8 Tensor (sparse) | 1,676.8 TOPS | 660.6 TOPS | 3,026 TOPS |
Note: all figures above are sparse (2:4 structured sparsity). Dense performance is exactly half these numbers. NVIDIA's marketing figures use sparse throughput; when comparing to third-party benchmarks that report dense numbers, divide the sparse values by 2.
The FP4 tier is forward-looking. Most production models today use FP8 or AWQ INT4 quantization. FP4 throughput will matter as Blackwell-native quantization tools like MXFP4 stabilize in vLLM and TRT-LLM. Check FP4 quantization on Blackwell GPUs for the current state of tooling support.
RT Cores, DLSS 4, and AI Features
The 170 4th-gen RT Cores (one per SM) improve ray-triangle intersection performance over the RTX 4090's 3rd-gen units. This affects rendering and ray-tracing workloads; for LLM inference it is not relevant.
DLSS 4 replaces the CNN-based upscaler in DLSS 3 with a transformer model, enabling multi-frame generation from single rendered frames. Again, a consumer rendering feature.
What does matter for AI practitioners: the RTX 5090 supports the Transformer Engine via its Tensor Cores, enabling automatic FP8/FP16 mixed-precision in supported frameworks (PyTorch, JAX, TRT-LLM). This reduces manual precision management in training and fine-tuning pipelines.
One caveat: RTX 5090 is a consumer card without ECC memory support. For production inference requiring error-correcting memory (finance, healthcare, safety-critical workloads), use H100 or B200 class hardware instead.
Power, Cooling, and PCIe 5.0 Connectivity
The RTX 5090 has a 575W TDP, up from 450W on the RTX 4090. That is a 28% power increase alongside a 78% bandwidth gain, which is a reasonable trade for performance-per-watt at this tier.
Power connector: 16-pin 12V-2x6. PSU sizing: 850W minimum for the full system, 1000W recommended to maintain headroom under sustained load. Running near PSU capacity degrades power delivery quality and increases failure risk.
PCIe 5.0 x16 doubles host-to-GPU transfer bandwidth versus PCIe 4.0: 64 GB/s bidirectional versus 32 GB/s. This matters when streaming large batches from CPU memory or doing high-frequency model swapping. The card works in a PCIe 4.0 slot at reduced transfer bandwidth, which typically has minimal impact on inference latency but can affect data pipeline throughput.
Cooling form factor varies by AIB partner (dual-slot or triple-slot open-air, blower for workstation configs). On cloud instances via Spheron, cooling is abstracted away entirely.
RTX 5090 vs RTX 4090: Spec Delta
| Spec | RTX 5090 | RTX 4090 | Delta |
|---|---|---|---|
| Architecture | Blackwell (GB202) | Ada Lovelace (AD102) | New die |
| CUDA Cores | 21,760 | 16,384 | +33% |
| VRAM | 32GB GDDR7 | 24GB GDDR6X | +8GB (+33%) |
| Bandwidth | 1,792 GB/s | 1,008 GB/s | +78% |
| FP16 Tensor (sparse) | 838.4 TFLOPS | 330.3 TFLOPS | +154% |
| FP4 support | Yes | No | New |
| TDP | 575W | 450W | +28% |
| Spheron on-demand | $0.86/hr | $0.53/hr | (live values) |
The RTX 5090 is the highest-bandwidth consumer GPU ever shipped. The 78% bandwidth jump over the 4090 is the headline number for inference workloads; the FP16 Tensor throughput gain (+154%) compounds on top for fine-tuning and training.
For a full workload benchmark with cost-per-token analysis, see RTX 5090 vs RTX 4090 for AI.
RTX 5090 vs RTX PRO 6000 Blackwell
The RTX PRO 6000 Blackwell uses the same GB202 die as the RTX 5090 but with very different memory: 96GB of GDDR7, a dedicated NVLINK connector for workstation NVLink configs, and ECC memory support. It drops DLSS and consumer gaming features in favor of reliability and headroom.
Pick the RTX PRO 6000 when: you need 48-96GB of VRAM for 30B-70B models at FP16 precision, your environment requires ECC memory, or you need data-center EULA compliance. Pick the RTX 5090 when: 32GB is sufficient for your models, you want the best cost-per-token for sub-32B inference on Spheron, or you want to run both options on the same platform without switching providers.
Both are available on Spheron: rent RTX 5090 or rent RTX PRO 6000 Blackwell.
RTX 5090 for AI Inference: VRAM Headroom by Model Size
The 32GB GDDR7 fits a wide range of production model sizes. Here is a breakdown at common precisions:
| Model | Precision | VRAM Required | Fits RTX 5090 (32GB)? |
|---|---|---|---|
| Llama 3.1 8B | FP16 | ~16GB | Yes |
| Llama 3.1 8B | FP8 | ~8GB | Yes |
| Llama 3.3 13B | FP16 | ~26GB | Yes |
| Qwen3 14B | FP16 | ~28GB | Tight - limit context |
| Qwen3 32B | Q4/AWQ | ~20GB | Yes |
| Llama 3.3 70B | INT4 | ~35-40GB | No |
| FLUX.1 Dev | BF16 | ~26GB | Yes |
| SDXL | FP16 | ~8-12GB | Yes |
For inference throughput, bandwidth drives token generation rate more than CUDA core count. The memory-bound decode phase loads model weights repeatedly for each token; 1,792 GB/s means you complete each load cycle faster. CUDA cores handle the compute phase, which is typically not the bottleneck on sub-13B models.
See GPU memory requirements for LLMs for the full model capacity table across all GPU tiers.
The RTX 5090 has no NVLink connector. Multi-GPU tensor parallelism (used to serve 70B+ models across multiple GPUs) is not practical on RTX 5090 instances. For 70B+ multi-GPU serving, use H100 SXM or B200.
Where to Rent RTX 5090: Spheron Pricing
Spheron is an on-demand and spot GPU marketplace aggregating compute from 5+ providers into a single platform with per-minute billing. RTX 5090 instances are available now, no quota requests needed, and typically deploy in under 2 minutes.
| Configuration | Type | Price |
|---|---|---|
| RTX 5090 (1x GPU) | On-demand | $0.86/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 20 May 2026 and may have changed. Check current GPU pricing for live rates.
You can rent RTX 5090 GPU instances on Spheron with per-minute billing and no minimum commitment.
Quick start with vLLM serving Llama 3.1 8B:
# Serve Llama 3.1 8B on RTX 5090 with vLLM
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90See the Spheron getting-started guide for SSH setup, instance types, and storage configuration.
The RTX 5090's 1,792 GB/s bandwidth and 32GB GDDR7 make it the highest-performance consumer GPU available for LLM inference today. Deploy a bare-metal instance on Spheron in under 2 minutes.
Frequently Asked Questions
The RTX 5090 has 21,760 CUDA cores across 170 streaming multiprocessors (SMs) on the GB202 die. This is 33% more CUDA cores than the RTX 4090's 16,384. Each SM includes 4th-gen RT Cores and 5th-gen Tensor Cores with native FP4 support.
The RTX 5090 has 32GB of GDDR7 memory on a 512-bit bus, delivering 1,792 GB/s of bandwidth. This is 8GB more than the RTX 4090's 24GB GDDR6X and fits 8B models at FP16 (~16GB), 13B models at FP16 (~26GB), and 32B models at Q4/AWQ (~20GB).
It depends on workload scale. The RTX 5090 is better for cost-sensitive inference on models up to 32B parameters - it delivers approximately $0.060-0.068 per million tokens for Llama 3.1 8B FP16, cheaper than the H100 PCIe. The H100 wins for 70B+ models (80GB HBM2e vs 32GB GDDR7), ECC memory requirements, multi-GPU NVLink tensor parallelism, and enterprise SLA workloads. For development and sub-30B inference, the RTX 5090 is the better value.
Yes. The RTX 5090 uses the Blackwell GB202 die and has native FP4 support via 5th-generation Tensor Cores. At FP4 sparse, it delivers 3,352 AI TOPS. FP4 quantization support in vLLM is in active development; check vLLM release notes for stable FP4 availability before using it in production.
The RTX 5090 has a 575W TDP. It uses a 16-pin (12V-2x6) power connector. You need a PSU with at least 850W total capacity (1000W recommended for a full system under load). It also requires PCIe 5.0 x16 for full bandwidth, though it will work in a PCIe 4.0 slot at reduced transfer bandwidth.
Comfortably: Llama 3.1 8B at FP16 (~16GB), Mistral 7B at FP16 (~14GB), Llama 3.3 13B at FP16 (~26GB). With quantization: Qwen3 32B at Q4/AWQ (~20GB fits with room for KV cache). Does not fit: Llama 3.3 70B at any quantization (~35-40GB at INT4), Llama 4 Scout 109B MoE (~55GB at INT4).
