Product

NVIDIA RTX PRO 6000 Blackwell for AI: 96GB GDDR7, Benchmarks, and Cloud Pricing

Back to BlogWritten by Mitrasish, Co-founderMar 9, 2026
GPU CloudNVIDIA RTX PRO 6000Blackwell GPUAI InferenceGPU RentalLLM30B Models
NVIDIA RTX PRO 6000 Blackwell for AI: 96GB GDDR7, Benchmarks, and Cloud Pricing

The NVIDIA RTX PRO 6000 Blackwell has 96GB of GDDR7 memory on a single chip. That is enough for 30B AWQ models with vast headroom, 32B models in both FP8 and FP16, 70B models in Q4 and FP8 quantization. On Spheron it runs at $1.65/hr on-demand or $0.72/hr on spot, both cheaper than the H100 PCIe at $2.01/hr on-demand, while offering 16GB more VRAM (96GB vs 80GB).

CloudRift published benchmarks showing a single RTX PRO 6000 delivering approximately 8,400 tokens per second on a 30B AWQ model, nearly matching a 4x RTX 4090 setup at 8,900 tokens per second. One GPU versus four, with near-identical throughput and lower total power draw.

This is a workstation-class GPU, not a datacenter GPU. It runs on PCIe, has GDDR7 rather than HBM, and lacks NVLink. For 30B through 70B FP8 inference workloads at low-to-medium concurrency, those tradeoffs are largely irrelevant. Here is what it actually delivers in practice.

RTX PRO 6000 Specs: What Matters for AI

The headline numbers first, then the comparisons that matter.

SpecRTX PRO 6000RTX 5090H100 PCIeA100 40GB SXM
VRAM96GB GDDR732GB GDDR780GB HBM2e40GB HBM2e
Memory bandwidth1.792 TB/s1.792 TB/s2.0 TB/s1.555 TB/s
ArchitectureBlackwellBlackwellHopperAmpere
FP4 supportYesYesNoNo
FP8 supportYesYesYesNo
NVLinkNoNoNo (PCIe)Yes (NVLink 3.0 on SXM)
Form factorWorkstation PCIeConsumer PCIePCIe / SXMSXM
ECC memoryYesNoYesYes

The RTX 5090 vs RTX PRO 6000 comparison clarifies where each GPU fits.

Both use Blackwell architecture with GDDR7 memory. The RTX PRO 6000 Workstation Edition delivers 1.792 TB/s bandwidth, equal to the RTX 5090's 1.792 TB/s. The PRO 6000's 96GB is 3x more than the RTX 5090's 32GB. That extra 64GB is decisive for large models: 70B FP8 (~70GB) and 32B FP16 (~64GB) fit in the PRO 6000's 96GB but not in the RTX 5090's 32GB. The PRO 6000 also includes ECC memory for production reliability, which the consumer RTX 5090 lacks.

The RTX PRO 6000 vs H100 PCIe comparison is equally revealing.

The PRO 6000 has more VRAM (96GB vs 80GB), enabling it to run models that exceed the H100 PCIe's capacity, such as large batches against 70B FP8 or any model requiring between 80GB and 96GB. The H100 PCIe uses HBM2e with 2.0 TB/s bandwidth, higher than the PRO 6000 Workstation Edition's 1.792 TB/s (an ~11.6% difference). At $1.65/hr on-demand (or $0.72/hr spot) vs $2.01/hr for the H100 PCIe on-demand, the PRO 6000 is cheaper with more VRAM. Its additional advantages are Blackwell FP4 support and higher per-dollar throughput on 30B AWQ inference workloads.

The RTX PRO 6000 supports FP4 precision, the same Blackwell-generation capability available in the B200 and RTX 5090, which H100 lacks entirely. For inference at FP4, this effectively doubles throughput compared to FP8. For more on how Blackwell architecture advances compare to prior generations, see our NVIDIA B300 Blackwell Ultra guide.

What Models Fit in 96GB VRAM

96GB covers the full range of production inference workloads, from small 8B models to 70B FP8 models on a single GPU. Here is a precise breakdown:

ModelSizePrecisionFits on RTX PRO 6000?Notes
Llama 3.1 8B8BFP16Yes~16GB (uses roughly 17% of VRAM)
Qwen2.5 14B14BFP16Yes~28GB (fits comfortably, ~68GB headroom)
Qwen2.5 32B32BFP8Yes~32GB (fits comfortably, ~64GB headroom for KV cache)
Qwen2.5 32B32BFP16Yes~64GB (fits comfortably, ~32GB headroom for KV cache)
Llama 3.3 70B70BQ4 (AWQ)Yes~35-40GB weights; ~56-61GB KV cache headroom
Llama 3.3 70B70BFP8Yes (fits)~70GB weights; ~26GB headroom, moderate batch sizes
Llama 3.3 70B70BFP16No~140GB needed (far exceeds 96GB)
Mixtral 8x7B~47B total paramsQ4Yes~24GB (very comfortable fit)
Mixtral 8x7B~47B total paramsFP16Technically yes, impractical~94GB weights fit within 96GB, but only ~2GB KV cache headroom remains; effectively unusable for real inference
DeepSeek-R1-Distill-Llama-70B70BQ4Yes~35-40GB, ~56-61GB KV cache headroom
DeepSeek-R1-Distill-Llama-70B70BFP8Yes (fits)~70GB, ~26GB KV headroom, moderate batch sizes
SDXLN/ABF16Yes~10GB (trivial)
Flux.1 DevN/ABF16Yes~23GB (plenty of headroom)

The defining capability: 96GB enables single-GPU inference across a wider range than any consumer or most workstation GPUs. 70B FP8 models fit with ~26GB remaining for KV cache, sufficient for moderate batch sizes and standard context lengths. For 70B FP16 (~140GB) or very large models requiring more than 96GB in any precision, you need an H200 (141GB) or a multi-GPU setup. For a complete VRAM sizing reference across all major 2026 models, see the GPU requirements cheat sheet 2026.

Real Benchmark Data

The benchmark data for the RTX PRO 6000 comes from CloudRift's published benchmark report RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark (October 2025). These numbers are from their published report and attributed to their testing methodology. The H100 PCIe 70B FP8 figure is a representative estimate based on published vLLM community benchmarks from 2025 at comparable batch sizes; actual throughput varies by vLLM version, batch size, context length, and hardware configuration.

30B AWQ inference throughput (CloudRift, Oct 2025):

GPUModelPrecisionTokens/secSource
RTX PRO 6000Qwen3-Coder-30B-A3B-Instruct-AWQAWQ~8,400CloudRift (Oct 2025)
4x RTX 4090Qwen3-Coder-30B-A3B-Instruct-AWQAWQ~8,900CloudRift (Oct 2025)

70B FP8 inference throughput (a heavier workload, not directly comparable to the 30B AWQ results above):

GPUModelPrecisionTokens/secSource
H100 PCIe70B FP8 modelFP8~2,800vLLM benchmarks (2025, est.)

The near-parity between 1x RTX PRO 6000 and 4x RTX 4090 on 30B models is the key result. The CloudRift test used Qwen3-Coder-30B-A3B-Instruct-AWQ (a sparse MoE model with approximately 30.5B total parameters and approximately 3.3B active parameters per forward pass) at 400 concurrent requests with 1,000 input and 1,000 output tokens. Because this is a sparse MoE model, only a fraction of parameters are active per forward pass, meaning the actual compute load resembles a much smaller dense model while the full parameter set is stored in VRAM in AWQ quantization. The 30B AWQ model weights occupy approximately 24GB within the PRO 6000's 96GB, leaving approximately 72GB of headroom for KV cache even at high concurrency. The 4x RTX 4090 setup runs the same model distributed across four GPUs with PCIe communication overhead between cards. You get near-identical throughput from simpler infrastructure at lower power consumption and without inter-GPU communication latency.

For context on how these results fit into the broader GPU benchmark landscape, see our GPU cloud benchmarks 2026 comparison.

Cost Per Million Tokens vs Alternatives

The financial case for the RTX PRO 6000 comes from running 30B models on one card instead of paying $2.01/hr for an H100 PCIe with lower throughput on those workloads. Spot pricing makes the per-token cost even more compelling.

30B model inference cost comparison (prices as of March 10, 2026):

GPUConfiguration$/hr on SpheronEst. tokens/sec (30B AWQ)Est. cost per 1M tokens
RTX PRO 60001x card (spot)$0.72~8,400~$0.02
RTX PRO 60001x card (on-demand)$1.65~8,400~$0.05
4x RTX 40904x cards (on-demand)$2.24~8,900~$0.07

On spot, a single RTX PRO 6000 at $0.72/hr delivers 30B AWQ inference at approximately $0.02 per million tokens. On-demand at $1.65/hr brings that to roughly $0.05 per million tokens, still cheaper than four RTX 4090s at $2.24/hr combined for near-identical throughput.

For 70B FP8 inference on a single card, the RTX PRO 6000 fits the model (~70GB) with approximately 26GB remaining for KV cache. The H100 PCIe (80GB) fits 70B FP8 as well (~10GB headroom), but at $2.01/hr on-demand versus $1.65/hr and with less VRAM headroom for context. For maximum 70B FP8 serving throughput, the H100 PCIe's higher memory bandwidth (2.0 TB/s vs 1.792 TB/s) provides an edge at very high concurrency.

For a complete cost analysis framework, see our GPU cost optimization playbook.

Best Use Cases for RTX PRO 6000

30B model inference at high throughput

The CloudRift benchmark result of 8,400 tokens/sec on a 30B AWQ model shows the PRO 6000 competing directly with a 4x RTX 4090 setup. For 30B inference at scale, one PRO 6000 card is simpler and significantly cheaper than a four-card configuration. vLLM with --quantization awq handles AWQ model loading, and --quantization fp8 with --dtype auto handles FP8 for 30B models well within the 96GB budget.

32B models in FP8 and FP16 precision on a single GPU

With 96GB of GDDR7, the RTX PRO 6000 runs 32B FP8 models (~32GB weights) on a single card with approximately 64GB remaining for KV cache, and 32B FP16 models (~64GB weights) with approximately 32GB remaining for KV cache. No tensor parallelism, no multi-GPU orchestration. For teams running Qwen2.5 32B or similar models in FP8 or FP16 format, the PRO 6000 provides a single-GPU path at $1.65/hr on-demand or $0.72/hr spot.

70B models in Q4 and FP8 quantization

70B models in Q4/AWQ quantization (~35-40GB) fit on a single RTX PRO 6000 with approximately 56-61GB remaining for KV cache. 70B FP8 models (~70GB) also fit, with approximately 26GB remaining for KV cache. This is sufficient for single-user to moderate-batch inference at standard context lengths. For production 70B FP8 serving at maximum concurrency, the H100 PCIe's higher memory bandwidth (2.0 TB/s) gives it an advantage as concurrent requests scale.

Development and testing for large models

Developing against 70B FP8, 70B Q4, and 32B FP16 models on a single card at $1.65/hr on-demand (or $0.72/hr on spot for non-time-sensitive iterations) makes large model development accessible without committing to multi-GPU reserved pricing. The 96GB is sufficient for iterative testing of production inference pipelines through 70B FP8.

Diffusion model pipelines with heavy ControlNet and LoRA stacking

96GB lets you load diffusion model stacks including base model, multiple ControlNets, and LoRA adapters simultaneously with extensive headroom. SDXL with several active adapters uses roughly 25-30GB of the available 96GB.

LoRA fine-tuning of smaller models

The PRO 6000's 96GB GDDR7 and Blackwell Tensor Cores make it capable for LoRA and QLoRA fine-tuning of models up to 30B-40B parameters, where gradient checkpointing keeps peak VRAM within bounds.

When NOT to Use RTX PRO 6000

70B models in FP16 precision

96GB is not enough for Llama 3.3 70B in FP16 (~140GB needed). For full-precision FP16 inference on 70B models, you need an H200 (141GB) or a multi-GPU setup.

Models requiring more than 96GB in any precision

Any model requiring more than 96GB of VRAM cannot run on a single RTX PRO 6000. This includes 70B FP16 (~140GB) and all larger models. These workloads require H200, B200, or multi-GPU infrastructure.

Mixtral 8x7B in FP16 precision

Mixtral 8x7B FP16 weights occupy approximately 94GB, which technically fits within 96GB. However, this leaves virtually no headroom for KV cache, making real inference workloads impractical. With near-zero KV cache budget, context lengths are severely constrained, producing unusable throughput. Use Mixtral 8x7B in Q4 format on the RTX PRO 6000 instead (fits comfortably at ~24GB, leaving ~72GB for KV cache) or use an H200 for full FP16 Mixtral inference.

Multi-GPU tensor parallelism

No NVLink means no high-bandwidth inter-GPU connectivity. If you need to scale beyond 96GB with tensor parallelism, you need H100 SXM with NVLink. Additional PRO 6000 cards connected over PCIe carry significant communication overhead.

Maximum inference throughput at high batch sizes

H100 SXM's HBM3 delivers 3.35 TB/s of memory bandwidth, significantly higher than the RTX PRO 6000's GDDR7 at 1.792 TB/s. The H100 PCIe at 2.0 TB/s is faster than the PRO 6000 Workstation Edition (~11.6% difference), but the H100 SXM's bandwidth advantage is substantial. The PRO 6000 is competitive at low-to-medium batch sizes but falls behind the H100 SXM as concurrent users scale toward maximum throughput.

Enterprise datacenter workloads at scale

This is a workstation GPU. It lacks MIG partitioning for multi-tenant serving and has no NVLink for cluster integration. For long-running production inference at scale with SLA requirements, H100 is the proven path. See our production GPU cloud architecture guide for datacenter-grade deployment patterns.

RTX PRO 6000 vs H100 PCIe: Direct Decision Guide

When you are choosing between the two most common options in this price range:

SituationChoose RTX PRO 6000Choose H100 PCIe
Running 30B models at high throughputYes (8,400 tokens/sec at $1.65/hr on-demand, $0.72/hr spot)Possible, lower throughput at $2.01/hr on-demand
Running 70B Q4 at moderate concurrencyYes (fits in 96GB, ~56-61GB KV cache)Yes (fits in 80GB, ~40-45GB KV cache)
Running 70B FP8 on single GPUYes (70GB fits in 96GB, ~26GB headroom)Yes (70GB fits in 80GB, ~10GB headroom; higher bandwidth)
Running 32B FP16 on single GPUYes (64GB fits in 96GB, ~32GB headroom)Yes (64GB fits in 80GB, ~16GB headroom)
Running 32B FP8 on single GPUYes (32GB fits in 96GB, ~64GB headroom)Yes (32GB fits in 80GB, ~48GB headroom)
More VRAM capacityYes (96GB vs 80GB)No (80GB vs 96GB)
More affordable on-demandYes ($1.65/hr)No ($2.01/hr)
Spot pricing availableYes ($0.72/hr)No
Maximum memory bandwidthNo (GDDR7 at 1.792 TB/s)Yes (HBM2e at 2.0 TB/s)
FP4 inference supportYes (Blackwell FP4)No (Hopper only supports FP8)
Multi-GPU training at scaleNo (no NVLink)No (PCIe variant lacks NVLink; need H100 SXM)
ECC memory for productionYes (professional grade)Yes (datacenter grade)

The H100 PCIe's bandwidth (2.0 TB/s vs 1.792 TB/s for the PRO 6000 Workstation Edition) represents an ~11.6% difference, which may matter at very high concurrency but is not typically decisive for 30B workloads. The RTX PRO 6000 wins on VRAM capacity (96GB vs 80GB), cost-per-token for 30B models, Blackwell FP4 support, and price ($1.65/hr on-demand or $0.72/hr spot vs $2.01/hr on-demand). For a detailed head-to-head comparison of H100 and H200, see our H100 vs H200 guide.

How to Deploy on Spheron

Deploying the RTX PRO 6000 on Spheron takes a few minutes:

  1. Go to app.spheron.ai and sign up with GitHub or Gmail
  2. Add credits using card or crypto, then click Deploy in the left-hand menu
  3. Select RTX PRO 6000 from the GPU catalog
  4. Choose your Ubuntu version, storage, and region, then click Deploy Instance

For 32B class models in FP8 (example: Qwen2.5-32B-Instruct), deploy with vLLM:

bash
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-32B-Instruct \
  --quantization fp8 \
  --dtype auto \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

For 30B AWQ inference:

bash
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

For 70B Q4 quantization at moderate concurrency:

bash
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model casperhansen/llama-3.3-70b-instruct-awq \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

With 96GB of VRAM, 70B Q4 weights (~35-40GB) leave ample headroom for KV cache. Target 85% GPU memory utilization for 70B Q4 workloads, which reserves approximately 14GB buffer on top of the 35-40GB model weight footprint.

Monitor VRAM with:

bash
watch -n 1 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv

Pricing on Spheron

Pricing as of March 10, 2026 from live Spheron GPU marketplace data. Check www.spheron.network/pricing for current rates.

RTX PRO 6000 pricing:

Configuration$/hrNotes
RTX PRO 6000 on-demand (dedicated)$1.65Always available, fixed rate
RTX PRO 6000 spot$0.72May be interrupted

On-demand instances are always-available at a fixed hourly rate. Spot instances run at a deeply discounted rate but can be interrupted when provider capacity is reclaimed, suitable for batch jobs, fine-tuning, and non-time-sensitive workloads.

For context, here is how the RTX PRO 6000 compares to alternatives across both pricing tiers:

GPUOn-demand $/hrSpot $/hrVRAMFits 70B FP8?Fits 70B Q4?
RTX 5090 PCIe$0.7632GBNo (70GB exceeds 32GB)No (35-40GB exceeds 32GB)
RTX PRO 6000$1.65$0.7296GBYes (~26GB KV headroom)Yes (comfortable, ~56-61GB KV headroom)
H100 PCIe$2.0180GBYes (~10GB headroom)Yes (comfortable)
H200 SXM5$4.23$1.43141GBYes (extensive headroom)Yes (extensive headroom)
2x H100 PCIe$4.032x80GBYes (with tensor parallelism)Yes (with tensor parallelism)

The RTX PRO 6000 occupies a clear position: more VRAM than both the RTX 5090 (96GB vs 32GB) and the H100 PCIe (96GB vs 80GB), Blackwell FP4 support, ECC memory, and lower price than the H100 PCIe ($1.65/hr vs $2.01/hr on-demand, $0.72/hr spot). The H100 PCIe's advantage is higher memory bandwidth (2.0 TB/s vs 1.792 TB/s) for throughput-bound workloads at very high concurrency, though the ~11.6% difference may not be decisive for 30B workloads. For 30B through 70B Q4 inference with maximum single-GPU VRAM at the lowest cost, the PRO 6000 leads.


RTX PRO 6000 instances on Spheron give you 96GB of Blackwell GDDR7 on a single chip, with FP4 support for next-generation inference efficiency. Available now, no waitlist.

Rent an RTX PRO 6000

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.