Research

NVIDIA RTX 5090 Specs: 32GB GDDR7, Blackwell Architecture, and 5th Gen Tensor Cores (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 20, 2026
rtx 5090 specsnvidia rtx 5090 specsrtx 5090 specifications5090 specsrtx 5090 gddr7rtx 5090 tensor coresrtx 5090 datasheetrtx 5090 cuda coresRTX 5090Blackwell ArchitectureGDDR7FP4 Tensor CoresAI InferenceVRAM CapacityConsumer GPU AIGB202
NVIDIA RTX 5090 Specs: 32GB GDDR7, Blackwell Architecture, and 5th Gen Tensor Cores (2026)

1,792 GB/s of memory bandwidth and 32GB of GDDR7 are the two specs that define what the RTX 5090 can do for AI workloads. That bandwidth figure is 78% higher than the RTX 4090 and puts the 5090 within striking distance of the H100 PCIe (2,000 GB/s) at a fraction of the rental cost. The 32GB gives you room for 13B models at full FP16 precision and 32B models under Q4 quantization on a single card. The full datasheet is below.

RTX 5090 Specs: Full Datasheet

SpecificationValue
ArchitectureBlackwell
GPU DieGB202
Process NodeTSMC 4NP
Transistors92.2 billion
Streaming Multiprocessors170
CUDA Cores21,760
5th-Gen Tensor Cores680
4th-Gen RT Cores170
VRAM32GB GDDR7
Memory Bus Width512-bit
Memory Bandwidth1,792 GB/s
FP32 Performance~105.2 TFLOPS
FP16 Tensor Performance (sparse)838.4 TFLOPS
FP8 Tensor Performance (sparse)~1,676.8 TFLOPS
FP4 Tensor Performance (sparse)3,352 TOPS
AI TOPS3,352
TDP575W
Power Connector16-pin (12V-2x6)
PCIe GenerationGen 5 x16
DLSS4
Encoder/Decoder2x NVENC / 2x NVDEC

Blackwell Architecture: GB202 Die

The RTX 5090 runs on NVIDIA's Blackwell architecture, specifically the GB202 die. The key upgrade from Ada Lovelace (RTX 4090) is the jump to 5th-generation Tensor Cores with native FP4 support. Ada Lovelace stopped at FP8; Blackwell adds a full FP4 precision tier that doubles effective throughput again for models that support it.

The GB202 die packs 92.2 billion transistors on TSMC's 4NP node into 170 streaming multiprocessors. Each SM contributes 128 CUDA cores, 4 Tensor Cores (5th gen), and 1 RT Core (4th gen). More SMs means more parallelism for both compute and inference workloads that scale well across threads.

The 4th-gen RT Cores improve ray-triangle intersection throughput over Ada's 3rd-gen, relevant for rendering but not LLM inference. DLSS 4 switches from a CNN to a transformer-based super-resolution model, enabling multi-frame generation. That's a consumer feature; for AI workloads, the Tensor Core upgrade is what matters.

Memory Subsystem: 32GB GDDR7 and 1,792 GB/s Bandwidth

For LLM inference, memory bandwidth is the primary bottleneck. Token generation is memory-bound: every forward pass requires loading model weights from VRAM, and bandwidth determines how fast that transfer happens. CUDA core count matters less than how quickly you can move data.

The RTX 5090 pairs a 512-bit bus with GDDR7 speeds to hit 1,792 GB/s. Here is how that compares to adjacent GPUs:

GPUVRAMBandwidth
RTX 509032GB GDDR71,792 GB/s
RTX 409024GB GDDR6X1,008 GB/s
H100 PCIe80GB HBM2e2,000 GB/s
H100 SXM580GB HBM33,350 GB/s

The RTX 5090 reaches 89% of H100 PCIe bandwidth and 54% of H100 SXM5 bandwidth. Given the rental cost gap between the 5090 and either H100 variant, that bandwidth ratio translates to very favorable cost-per-token numbers for models under 32B parameters. The 8GB VRAM advantage over the RTX 4090 also matters for 13B-32B inference where 24GB is tight.

5th-Gen Tensor Cores: FP4, FP8, FP16, BF16, INT8 Throughput

The jump from 4th to 5th-gen Tensor Cores brings two changes: native FP4 support and higher per-SM throughput at FP8 and FP16. The precision table below shows where the RTX 5090 lands relative to the RTX 4090 and H100 PCIe:

PrecisionRTX 5090RTX 4090H100 PCIe
FP4 (sparse)3,352 TOPSNot supportedNot supported
FP8 (sparse)~1,676 TFLOPS~1,321 TFLOPS3,026 TFLOPS
FP16 Tensor (sparse)838.4 TFLOPS330.3 TFLOPS1,513 TFLOPS
BF16 Tensor (sparse)838.4 TFLOPS330.3 TFLOPS1,513 TFLOPS
INT8 Tensor (sparse)1,676.8 TOPS660.6 TOPS3,026 TOPS

Note: all figures above are sparse (2:4 structured sparsity). Dense performance is exactly half these numbers. NVIDIA's marketing figures use sparse throughput; when comparing to third-party benchmarks that report dense numbers, divide the sparse values by 2.

The FP4 tier is forward-looking. Most production models today use FP8 or AWQ INT4 quantization. FP4 throughput will matter as Blackwell-native quantization tools like MXFP4 stabilize in vLLM and TRT-LLM. Check FP4 quantization on Blackwell GPUs for the current state of tooling support.

RT Cores, DLSS 4, and AI Features

The 170 4th-gen RT Cores (one per SM) improve ray-triangle intersection performance over the RTX 4090's 3rd-gen units. This affects rendering and ray-tracing workloads; for LLM inference it is not relevant.

DLSS 4 replaces the CNN-based upscaler in DLSS 3 with a transformer model, enabling multi-frame generation from single rendered frames. Again, a consumer rendering feature.

What does matter for AI practitioners: the RTX 5090 supports the Transformer Engine via its Tensor Cores, enabling automatic FP8/FP16 mixed-precision in supported frameworks (PyTorch, JAX, TRT-LLM). This reduces manual precision management in training and fine-tuning pipelines.

One caveat: RTX 5090 is a consumer card without ECC memory support. For production inference requiring error-correcting memory (finance, healthcare, safety-critical workloads), use H100 or B200 class hardware instead.

Power, Cooling, and PCIe 5.0 Connectivity

The RTX 5090 has a 575W TDP, up from 450W on the RTX 4090. That is a 28% power increase alongside a 78% bandwidth gain, which is a reasonable trade for performance-per-watt at this tier.

Power connector: 16-pin 12V-2x6. PSU sizing: 850W minimum for the full system, 1000W recommended to maintain headroom under sustained load. Running near PSU capacity degrades power delivery quality and increases failure risk.

PCIe 5.0 x16 doubles host-to-GPU transfer bandwidth versus PCIe 4.0: 64 GB/s bidirectional versus 32 GB/s. This matters when streaming large batches from CPU memory or doing high-frequency model swapping. The card works in a PCIe 4.0 slot at reduced transfer bandwidth, which typically has minimal impact on inference latency but can affect data pipeline throughput.

Cooling form factor varies by AIB partner (dual-slot or triple-slot open-air, blower for workstation configs). On cloud instances via Spheron, cooling is abstracted away entirely.

RTX 5090 vs RTX 4090: Spec Delta

SpecRTX 5090RTX 4090Delta
ArchitectureBlackwell (GB202)Ada Lovelace (AD102)New die
CUDA Cores21,76016,384+33%
VRAM32GB GDDR724GB GDDR6X+8GB (+33%)
Bandwidth1,792 GB/s1,008 GB/s+78%
FP16 Tensor (sparse)838.4 TFLOPS330.3 TFLOPS+154%
FP4 supportYesNoNew
TDP575W450W+28%
Spheron on-demand$0.86/hr$0.53/hr(live values)

The RTX 5090 is the highest-bandwidth consumer GPU ever shipped. The 78% bandwidth jump over the 4090 is the headline number for inference workloads; the FP16 Tensor throughput gain (+154%) compounds on top for fine-tuning and training.

For a full workload benchmark with cost-per-token analysis, see RTX 5090 vs RTX 4090 for AI.

RTX 5090 vs RTX PRO 6000 Blackwell

The RTX PRO 6000 Blackwell uses the same GB202 die as the RTX 5090 but with very different memory: 96GB of GDDR7, a dedicated NVLINK connector for workstation NVLink configs, and ECC memory support. It drops DLSS and consumer gaming features in favor of reliability and headroom.

Pick the RTX PRO 6000 when: you need 48-96GB of VRAM for 30B-70B models at FP16 precision, your environment requires ECC memory, or you need data-center EULA compliance. Pick the RTX 5090 when: 32GB is sufficient for your models, you want the best cost-per-token for sub-32B inference on Spheron, or you want to run both options on the same platform without switching providers.

Both are available on Spheron: rent RTX 5090 or rent RTX PRO 6000 Blackwell.

RTX 5090 for AI Inference: VRAM Headroom by Model Size

The 32GB GDDR7 fits a wide range of production model sizes. Here is a breakdown at common precisions:

ModelPrecisionVRAM RequiredFits RTX 5090 (32GB)?
Llama 3.1 8BFP16~16GBYes
Llama 3.1 8BFP8~8GBYes
Llama 3.3 13BFP16~26GBYes
Qwen3 14BFP16~28GBTight - limit context
Qwen3 32BQ4/AWQ~20GBYes
Llama 3.3 70BINT4~35-40GBNo
FLUX.1 DevBF16~26GBYes
SDXLFP16~8-12GBYes

For inference throughput, bandwidth drives token generation rate more than CUDA core count. The memory-bound decode phase loads model weights repeatedly for each token; 1,792 GB/s means you complete each load cycle faster. CUDA cores handle the compute phase, which is typically not the bottleneck on sub-13B models.

See GPU memory requirements for LLMs for the full model capacity table across all GPU tiers.

The RTX 5090 has no NVLink connector. Multi-GPU tensor parallelism (used to serve 70B+ models across multiple GPUs) is not practical on RTX 5090 instances. For 70B+ multi-GPU serving, use H100 SXM or B200.

Where to Rent RTX 5090: Spheron Pricing

Spheron is an on-demand and spot GPU marketplace aggregating compute from 5+ providers into a single platform with per-minute billing. RTX 5090 instances are available now, no quota requests needed, and typically deploy in under 2 minutes.

ConfigurationTypePrice
RTX 5090 (1x GPU)On-demand$0.86/hr

Pricing fluctuates based on GPU availability. The prices above are based on 20 May 2026 and may have changed. Check current GPU pricing for live rates.

You can rent RTX 5090 GPU instances on Spheron with per-minute billing and no minimum commitment.

Quick start with vLLM serving Llama 3.1 8B:

bash
# Serve Llama 3.1 8B on RTX 5090 with vLLM
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

See the Spheron getting-started guide for SSH setup, instance types, and storage configuration.


The RTX 5090's 1,792 GB/s bandwidth and 32GB GDDR7 make it the highest-performance consumer GPU available for LLM inference today. Deploy a bare-metal instance on Spheron in under 2 minutes.

Rent RTX 5090 → | View all GPU pricing → | Spheron docs →

FAQ / 06

Frequently Asked Questions

The RTX 5090 has 21,760 CUDA cores across 170 streaming multiprocessors (SMs) on the GB202 die. This is 33% more CUDA cores than the RTX 4090's 16,384. Each SM includes 4th-gen RT Cores and 5th-gen Tensor Cores with native FP4 support.

The RTX 5090 has 32GB of GDDR7 memory on a 512-bit bus, delivering 1,792 GB/s of bandwidth. This is 8GB more than the RTX 4090's 24GB GDDR6X and fits 8B models at FP16 (~16GB), 13B models at FP16 (~26GB), and 32B models at Q4/AWQ (~20GB).

It depends on workload scale. The RTX 5090 is better for cost-sensitive inference on models up to 32B parameters - it delivers approximately $0.060-0.068 per million tokens for Llama 3.1 8B FP16, cheaper than the H100 PCIe. The H100 wins for 70B+ models (80GB HBM2e vs 32GB GDDR7), ECC memory requirements, multi-GPU NVLink tensor parallelism, and enterprise SLA workloads. For development and sub-30B inference, the RTX 5090 is the better value.

Yes. The RTX 5090 uses the Blackwell GB202 die and has native FP4 support via 5th-generation Tensor Cores. At FP4 sparse, it delivers 3,352 AI TOPS. FP4 quantization support in vLLM is in active development; check vLLM release notes for stable FP4 availability before using it in production.

The RTX 5090 has a 575W TDP. It uses a 16-pin (12V-2x6) power connector. You need a PSU with at least 850W total capacity (1000W recommended for a full system under load). It also requires PCIe 5.0 x16 for full bandwidth, though it will work in a PCIe 4.0 slot at reduced transfer bandwidth.

Comfortably: Llama 3.1 8B at FP16 (~16GB), Mistral 7B at FP16 (~14GB), Llama 3.3 13B at FP16 (~26GB). With quantization: Qwen3 32B at Q4/AWQ (~20GB fits with room for KV cache). Does not fit: Llama 3.3 70B at any quantization (~35-40GB at INT4), Llama 4 Scout 109B MoE (~55GB at INT4).

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.