What models can the RTX PRO 6000 run on a single GPU?

With 96GB GDDR7, the RTX PRO 6000 handles 30B AWQ models with vast headroom (~24GB weights, leaving ~72GB for KV cache), 32B FP8 (~32GB, comfortable fit with ~64GB headroom), 32B FP16 (~64GB, fits with ~32GB headroom), and 70B Q4 (~35-40GB, very comfortable with ~56-61GB headroom) and 70B FP8 (~70GB, fits with ~26GB headroom). Models requiring more than 96GB, such as 70B FP16 (~140GB), require multi-GPU setups or an H200.

What is the RTX PRO 6000's memory bandwidth?

The RTX PRO 6000 Blackwell Workstation Edition delivers 1.792 TB/s memory bandwidth (1,792 GB/s) from its GDDR7 memory, equal to the RTX 5090 (1.792 TB/s) and lower than the H100 PCIe's HBM2e bandwidth (2.0 TB/s). The H100 PCIe's bandwidth advantage over the PRO 6000 Workstation Edition is approximately 11.6%. For 30B AWQ inference, Blackwell's FP4/FP8 Tensor Core efficiency is a decisive factor, as shown in CloudRift benchmarks where a single PRO 6000 matched throughput of a 4x RTX 4090 setup.

Does the RTX PRO 6000 support NVLink for multi-GPU setups?

No. The RTX PRO 6000 is a workstation GPU without NVLink support. For tensor parallelism across multiple GPUs, you need H100 SXM or similar datacenter GPUs with NVLink. The RTX PRO 6000's strength is running 30B through 70B FP8 models at high throughput on a single card at a lower cost than multi-GPU datacenter setups.

What is RTX PRO 6000 pricing on Spheron?

Spheron offers RTX PRO 6000 instances at $1.65/hr on-demand (dedicated) or $0.72/hr on spot (may be interrupted). Both are cheaper than the H100 PCIe at $2.01/hr on-demand, while offering more VRAM (96GB vs 80GB) and strong throughput on 30B AWQ inference workloads per dollar spent. Check www.spheron.network/pricing for current rates.

NVIDIA RTX PRO 6000 Blackwell for AI: 96GB GDDR7, Benchmarks, and Cloud Pricing

Q: How does the RTX PRO 6000 compare to the RTX 5090?

The RTX PRO 6000 has 96GB GDDR7 versus the RTX 5090's 32GB GDDR7, giving 3x more VRAM for larger models. Both use Blackwell architecture with FP4 and FP8 support. The PRO 6000 Workstation Edition delivers 1.792 TB/s bandwidth, equal to the RTX 5090's 1.792 TB/s. The PRO 6000 adds ECC memory for professional reliability, which the consumer RTX 5090 lacks. The extra 64GB is critical for running 70B FP8 and 32B FP16 models that simply cannot fit within 32GB.

The NVIDIA RTX PRO 6000 Blackwell has 96GB of GDDR7 memory on a single chip. That is enough for 30B AWQ models with vast headroom, 32B models in both FP8 and FP16, 70B models in Q4 and FP8 quantization. On Spheron it runs at $1.65/hr on-demand or $0.72/hr on spot, both cheaper than the H100 PCIe at $2.01/hr on-demand, while offering 16GB more VRAM (96GB vs 80GB).

CloudRift published benchmarks showing a single RTX PRO 6000 delivering approximately 8,400 tokens per second on a 30B AWQ model, nearly matching a 4x RTX 4090 setup at 8,900 tokens per second. One GPU versus four, with near-identical throughput and lower total power draw.

This is a workstation-class GPU, not a datacenter GPU. It runs on PCIe, has GDDR7 rather than HBM, and lacks NVLink. For 30B through 70B FP8 inference workloads at low-to-medium concurrency, those tradeoffs are largely irrelevant. Here is what it actually delivers in practice.

RTX PRO 6000 Specs: What Matters for AI

The headline numbers first, then the comparisons that matter.

Spec	RTX PRO 6000	RTX 5090	H100 PCIe	A100 40GB SXM
VRAM	96GB GDDR7	32GB GDDR7	80GB HBM2e	40GB HBM2e
Memory bandwidth	1.792 TB/s	1.792 TB/s	2.0 TB/s	1.555 TB/s
Architecture	Blackwell	Blackwell	Hopper	Ampere
FP4 support	Yes	Yes	No	No
FP8 support	Yes	Yes	Yes	No
NVLink	No	No	No (PCIe)	Yes (NVLink 3.0 on SXM)
Form factor	Workstation PCIe	Consumer PCIe	PCIe / SXM	SXM
ECC memory	Yes	No	Yes	Yes

The RTX 5090 vs RTX PRO 6000 comparison clarifies where each GPU fits.

Both use Blackwell architecture with GDDR7 memory. The RTX PRO 6000 Workstation Edition delivers 1.792 TB/s bandwidth, equal to the RTX 5090's 1.792 TB/s. The PRO 6000's 96GB is 3x more than the RTX 5090's 32GB. That extra 64GB is decisive for large models: 70B FP8 (~70GB) and 32B FP16 (~64GB) fit in the PRO 6000's 96GB but not in the RTX 5090's 32GB. The PRO 6000 also includes ECC memory for production reliability, which the consumer RTX 5090 lacks.

The RTX PRO 6000 vs H100 PCIe comparison is equally revealing.

The PRO 6000 has more VRAM (96GB vs 80GB), enabling it to run models that exceed the H100 PCIe's capacity, such as large batches against 70B FP8 or any model requiring between 80GB and 96GB. The H100 PCIe uses HBM2e with 2.0 TB/s bandwidth, higher than the PRO 6000 Workstation Edition's 1.792 TB/s (an ~11.6% difference). At $1.65/hr on-demand (or $0.72/hr spot) vs $2.01/hr for the H100 PCIe on-demand, the PRO 6000 is cheaper with more VRAM. Its additional advantages are Blackwell FP4 support and higher per-dollar throughput on 30B AWQ inference workloads.

The RTX PRO 6000 supports FP4 precision, the same Blackwell-generation capability available in the B200 and RTX 5090, which H100 lacks entirely. For inference at FP4, this effectively doubles throughput compared to FP8. For more on how Blackwell architecture advances compare to prior generations, see our NVIDIA B300 Blackwell Ultra guide.

What Models Fit in 96GB VRAM

96GB covers the full range of production inference workloads, from small 8B models to 70B FP8 models on a single GPU. Here is a precise breakdown:

Model	Size	Precision	Fits on RTX PRO 6000?	Notes
Llama 3.1 8B	8B	FP16	Yes	~16GB (uses roughly 17% of VRAM)
Qwen2.5 14B	14B	FP16	Yes	~28GB (fits comfortably, ~68GB headroom)
Qwen2.5 32B	32B	FP8	Yes	~32GB (fits comfortably, ~64GB headroom for KV cache)
Qwen2.5 32B	32B	FP16	Yes	~64GB (fits comfortably, ~32GB headroom for KV cache)
Llama 3.3 70B	70B	Q4 (AWQ)	Yes	~35-40GB weights; ~56-61GB KV cache headroom
Llama 3.3 70B	70B	FP8	Yes (fits)	~70GB weights; ~26GB headroom, moderate batch sizes
Llama 3.3 70B	70B	FP16	No	~140GB needed (far exceeds 96GB)
Mixtral 8x7B	~47B total params	Q4	Yes	~24GB (very comfortable fit)
Mixtral 8x7B	~47B total params	FP16	Technically yes, impractical	~94GB weights fit within 96GB, but only ~2GB KV cache headroom remains; effectively unusable for real inference
DeepSeek-R1-Distill-Llama-70B	70B	Q4	Yes	~35-40GB, ~56-61GB KV cache headroom
DeepSeek-R1-Distill-Llama-70B	70B	FP8	Yes (fits)	~70GB, ~26GB KV headroom, moderate batch sizes
SDXL	N/A	BF16	Yes	~10GB (trivial)
Flux.1 Dev	N/A	BF16	Yes	~23GB (plenty of headroom)

The defining capability: 96GB enables single-GPU inference across a wider range than any consumer or most workstation GPUs. 70B FP8 models fit with ~26GB remaining for KV cache, sufficient for moderate batch sizes and standard context lengths. For 70B FP16 (~140GB) or very large models requiring more than 96GB in any precision, you need an H200 (141GB) or a multi-GPU setup. For a complete VRAM sizing reference across all major 2026 models, see the GPU requirements cheat sheet 2026.

Real Benchmark Data

The benchmark data for the RTX PRO 6000 comes from CloudRift's published benchmark report RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark (October 2025). These numbers are from their published report and attributed to their testing methodology. The H100 PCIe 70B FP8 figure is a representative estimate based on published vLLM community benchmarks from 2025 at comparable batch sizes; actual throughput varies by vLLM version, batch size, context length, and hardware configuration.

30B AWQ inference throughput (CloudRift, Oct 2025):

GPU	Model	Precision	Tokens/sec	Source
RTX PRO 6000	Qwen3-Coder-30B-A3B-Instruct-AWQ	AWQ	~8,400	CloudRift (Oct 2025)
4x RTX 4090	Qwen3-Coder-30B-A3B-Instruct-AWQ	AWQ	~8,900	CloudRift (Oct 2025)

70B FP8 inference throughput (a heavier workload, not directly comparable to the 30B AWQ results above):

GPU	Model	Precision	Tokens/sec	Source
H100 PCIe	70B FP8 model	FP8	~2,800	vLLM benchmarks (2025, est.)

The near-parity between 1x RTX PRO 6000 and 4x RTX 4090 on 30B models is the key result. The CloudRift test used Qwen3-Coder-30B-A3B-Instruct-AWQ (a sparse MoE model with approximately 30.5B total parameters and approximately 3.3B active parameters per forward pass) at 400 concurrent requests with 1,000 input and 1,000 output tokens. Because this is a sparse MoE model, only a fraction of parameters are active per forward pass, meaning the actual compute load resembles a much smaller dense model while the full parameter set is stored in VRAM in AWQ quantization. The 30B AWQ model weights occupy approximately 24GB within the PRO 6000's 96GB, leaving approximately 72GB of headroom for KV cache even at high concurrency. The 4x RTX 4090 setup runs the same model distributed across four GPUs with PCIe communication overhead between cards. You get near-identical throughput from simpler infrastructure at lower power consumption and without inter-GPU communication latency.

For context on how these results fit into the broader GPU benchmark landscape, see our GPU cloud benchmarks 2026 comparison.

Cost Per Million Tokens vs Alternatives

The financial case for the RTX PRO 6000 comes from running 30B models on one card instead of paying $2.01/hr for an H100 PCIe with lower throughput on those workloads. Spot pricing makes the per-token cost even more compelling.

30B model inference cost comparison (prices as of March 10, 2026):

GPU	Configuration	$/hr on Spheron	Est. tokens/sec (30B AWQ)	Est. cost per 1M tokens
RTX PRO 6000	1x card (spot)	$0.72	~8,400	~$0.02
RTX PRO 6000	1x card (on-demand)	$1.65	~8,400	~$0.05
4x RTX 4090	4x cards (on-demand)	$2.24	~8,900	~$0.07

On spot, a single RTX PRO 6000 at $0.72/hr delivers 30B AWQ inference at approximately $0.02 per million tokens. On-demand at $1.65/hr brings that to roughly $0.05 per million tokens, still cheaper than four RTX 4090s at $2.24/hr combined for near-identical throughput.

For 70B FP8 inference on a single card, the RTX PRO 6000 fits the model (~70GB) with approximately 26GB remaining for KV cache. The H100 PCIe (80GB) fits 70B FP8 as well (~10GB headroom), but at $2.01/hr on-demand versus $1.65/hr and with less VRAM headroom for context. For maximum 70B FP8 serving throughput, the H100 PCIe's higher memory bandwidth (2.0 TB/s vs 1.792 TB/s) provides an edge at very high concurrency.

For a complete cost analysis framework, see our GPU cost optimization playbook.

Best Use Cases for RTX PRO 6000

30B model inference at high throughput

The CloudRift benchmark result of 8,400 tokens/sec on a 30B AWQ model shows the PRO 6000 competing directly with a 4x RTX 4090 setup. For 30B inference at scale, one PRO 6000 card is simpler and significantly cheaper than a four-card configuration. vLLM with --quantization awq handles AWQ model loading, and --quantization fp8 with --dtype auto handles FP8 for 30B models well within the 96GB budget.

32B models in FP8 and FP16 precision on a single GPU

With 96GB of GDDR7, the RTX PRO 6000 runs 32B FP8 models (~32GB weights) on a single card with approximately 64GB remaining for KV cache, and 32B FP16 models (~64GB weights) with approximately 32GB remaining for KV cache. No tensor parallelism, no multi-GPU orchestration. For teams running Qwen2.5 32B or similar models in FP8 or FP16 format, the PRO 6000 provides a single-GPU path at $1.65/hr on-demand or $0.72/hr spot.

70B models in Q4 and FP8 quantization

70B models in Q4/AWQ quantization (~35-40GB) fit on a single RTX PRO 6000 with approximately 56-61GB remaining for KV cache. 70B FP8 models (~70GB) also fit, with approximately 26GB remaining for KV cache. This is sufficient for single-user to moderate-batch inference at standard context lengths. For production 70B FP8 serving at maximum concurrency, the H100 PCIe's higher memory bandwidth (2.0 TB/s) gives it an advantage as concurrent requests scale.

Development and testing for large models

Developing against 70B FP8, 70B Q4, and 32B FP16 models on a single card at $1.65/hr on-demand (or $0.72/hr on spot for non-time-sensitive iterations) makes large model development accessible without committing to multi-GPU reserved pricing. The 96GB is sufficient for iterative testing of production inference pipelines through 70B FP8.

Diffusion model pipelines with heavy ControlNet and LoRA stacking

96GB lets you load diffusion model stacks including base model, multiple ControlNets, and LoRA adapters simultaneously with extensive headroom. SDXL with several active adapters uses roughly 25-30GB of the available 96GB.

LoRA fine-tuning of smaller models

The PRO 6000's 96GB GDDR7 and Blackwell Tensor Cores make it capable for LoRA and QLoRA fine-tuning of models up to 30B-40B parameters, where gradient checkpointing keeps peak VRAM within bounds.

When NOT to Use RTX PRO 6000

70B models in FP16 precision

96GB is not enough for Llama 3.3 70B in FP16 (~140GB needed). For full-precision FP16 inference on 70B models, you need an H200 (141GB) or a multi-GPU setup.

Models requiring more than 96GB in any precision

Any model requiring more than 96GB of VRAM cannot run on a single RTX PRO 6000. This includes 70B FP16 (~140GB) and all larger models. These workloads require H200, B200, or multi-GPU infrastructure.

Mixtral 8x7B in FP16 precision

Mixtral 8x7B FP16 weights occupy approximately 94GB, which technically fits within 96GB. However, this leaves virtually no headroom for KV cache, making real inference workloads impractical. With near-zero KV cache budget, context lengths are severely constrained, producing unusable throughput. Use Mixtral 8x7B in Q4 format on the RTX PRO 6000 instead (fits comfortably at ~24GB, leaving ~72GB for KV cache) or use an H200 for full FP16 Mixtral inference.

Multi-GPU tensor parallelism

No NVLink means no high-bandwidth inter-GPU connectivity. If you need to scale beyond 96GB with tensor parallelism, you need H100 SXM with NVLink. Additional PRO 6000 cards connected over PCIe carry significant communication overhead.

Maximum inference throughput at high batch sizes

H100 SXM's HBM3 delivers 3.35 TB/s of memory bandwidth, significantly higher than the RTX PRO 6000's GDDR7 at 1.792 TB/s. The H100 PCIe at 2.0 TB/s is faster than the PRO 6000 Workstation Edition (~11.6% difference), but the H100 SXM's bandwidth advantage is substantial. The PRO 6000 is competitive at low-to-medium batch sizes but falls behind the H100 SXM as concurrent users scale toward maximum throughput.

Enterprise datacenter workloads at scale

This is a workstation GPU. It lacks MIG partitioning for multi-tenant serving and has no NVLink for cluster integration. For long-running production inference at scale with SLA requirements, H100 is the proven path. See our production GPU cloud architecture guide for datacenter-grade deployment patterns.

RTX PRO 6000 vs H100 PCIe: Direct Decision Guide

When you are choosing between the two most common options in this price range:

Situation	Choose RTX PRO 6000	Choose H100 PCIe
Running 30B models at high throughput	Yes (8,400 tokens/sec at $1.65/hr on-demand, $0.72/hr spot)	Possible, lower throughput at $2.01/hr on-demand
Running 70B Q4 at moderate concurrency	Yes (fits in 96GB, ~56-61GB KV cache)	Yes (fits in 80GB, ~40-45GB KV cache)
Running 70B FP8 on single GPU	Yes (70GB fits in 96GB, ~26GB headroom)	Yes (70GB fits in 80GB, ~10GB headroom; higher bandwidth)
Running 32B FP16 on single GPU	Yes (64GB fits in 96GB, ~32GB headroom)	Yes (64GB fits in 80GB, ~16GB headroom)
Running 32B FP8 on single GPU	Yes (32GB fits in 96GB, ~64GB headroom)	Yes (32GB fits in 80GB, ~48GB headroom)
More VRAM capacity	Yes (96GB vs 80GB)	No (80GB vs 96GB)
More affordable on-demand	Yes ($1.65/hr)	No ($2.01/hr)
Spot pricing available	Yes ($0.72/hr)	No
Maximum memory bandwidth	No (GDDR7 at 1.792 TB/s)	Yes (HBM2e at 2.0 TB/s)
FP4 inference support	Yes (Blackwell FP4)	No (Hopper only supports FP8)
Multi-GPU training at scale	No (no NVLink)	No (PCIe variant lacks NVLink; need H100 SXM)
ECC memory for production	Yes (professional grade)	Yes (datacenter grade)

The H100 PCIe's bandwidth (2.0 TB/s vs 1.792 TB/s for the PRO 6000 Workstation Edition) represents an ~11.6% difference, which may matter at very high concurrency but is not typically decisive for 30B workloads. The RTX PRO 6000 wins on VRAM capacity (96GB vs 80GB), cost-per-token for 30B models, Blackwell FP4 support, and price ($1.65/hr on-demand or $0.72/hr spot vs $2.01/hr on-demand). For a detailed head-to-head comparison of H100 and H200, see our H100 vs H200 guide.

How to Deploy on Spheron

Deploying the RTX PRO 6000 on Spheron takes a few minutes:

Go to app.spheron.ai and sign up with GitHub or Gmail
Add credits using card or crypto, then click Deploy in the left-hand menu
Select RTX PRO 6000 from the GPU catalog
Choose your Ubuntu version, storage, and region, then click Deploy Instance

For 32B class models in FP8 (example: Qwen2.5-32B-Instruct), deploy with vLLM:

bash

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-32B-Instruct \
  --quantization fp8 \
  --dtype auto \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

For 30B AWQ inference:

bash

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

For 70B Q4 quantization at moderate concurrency:

bash

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model casperhansen/llama-3.3-70b-instruct-awq \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

With 96GB of VRAM, 70B Q4 weights (~35-40GB) leave ample headroom for KV cache. Target 85% GPU memory utilization for 70B Q4 workloads, which reserves approximately 14GB buffer on top of the 35-40GB model weight footprint.

Monitor VRAM with:

bash

watch -n 1 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv

Pricing on Spheron

Pricing as of March 10, 2026 from live Spheron GPU marketplace data. Check www.spheron.network/pricing for current rates.

RTX PRO 6000 pricing:

Configuration	$/hr	Notes
RTX PRO 6000 on-demand (dedicated)	$1.65	Always available, fixed rate
RTX PRO 6000 spot	$0.72	May be interrupted

On-demand instances are always-available at a fixed hourly rate. Spot instances run at a deeply discounted rate but can be interrupted when provider capacity is reclaimed, suitable for batch jobs, fine-tuning, and non-time-sensitive workloads.

For context, here is how the RTX PRO 6000 compares to alternatives across both pricing tiers:

GPU	On-demand $/hr	Spot $/hr	VRAM	Fits 70B FP8?	Fits 70B Q4?
RTX 5090 PCIe	$0.76	—	32GB	No (70GB exceeds 32GB)	No (35-40GB exceeds 32GB)
RTX PRO 6000	$1.65	$0.72	96GB	Yes (~26GB KV headroom)	Yes (comfortable, ~56-61GB KV headroom)
H100 PCIe	$2.01	—	80GB	Yes (~10GB headroom)	Yes (comfortable)
H200 SXM5	$4.23	$1.43	141GB	Yes (extensive headroom)	Yes (extensive headroom)
2x H100 PCIe	$4.03	—	2x80GB	Yes (with tensor parallelism)	Yes (with tensor parallelism)

The RTX PRO 6000 occupies a clear position: more VRAM than both the RTX 5090 (96GB vs 32GB) and the H100 PCIe (96GB vs 80GB), Blackwell FP4 support, ECC memory, and lower price than the H100 PCIe ($1.65/hr vs $2.01/hr on-demand, $0.72/hr spot). The H100 PCIe's advantage is higher memory bandwidth (2.0 TB/s vs 1.792 TB/s) for throughput-bound workloads at very high concurrency, though the ~11.6% difference may not be decisive for 30B workloads. For 30B through 70B Q4 inference with maximum single-GPU VRAM at the lowest cost, the PRO 6000 leads.

RTX PRO 6000 instances on Spheron give you 96GB of Blackwell GDDR7 on a single chip, with FP4 support for next-generation inference efficiency. Available now, no waitlist.
Rent an RTX PRO 6000

RTX PRO 6000 Specs: What Matters for AI

What Models Fit in 96GB VRAM

Real Benchmark Data

Cost Per Million Tokens vs Alternatives

Best Use Cases for RTX PRO 6000

When NOT to Use RTX PRO 6000

RTX PRO 6000 vs H100 PCIe: Direct Decision Guide

How to Deploy on Spheron

Pricing on Spheron

Build what's next.