Spheron GPU Catalog

Rent NVIDIA RTX PRO 6000 GPUs on Demand from $0.59/hr

96GB GDDR7 ECC Blackwell, built to run 70B FP8 LLMs on a single GPU.

At a glance

You can rent an NVIDIA RTX PRO 6000 Blackwell on Spheron starting at $0.59/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with spot pricing cheaper still. Per-minute billing, no long-term contracts, and instances deploy in under 2 minutes across data center partners in multiple regions. Each card ships with 96GB GDDR7 ECC, 1.79 TB/s memory bandwidth, 24,064 CUDA cores, and 5th generation Tensor Cores with native FP4 support, giving you the largest single-GPU VRAM available outside HBM datacenter SKUs. Perfect for teams that need to run 30B-70B LLMs at FP8 on a single GPU, fine-tune medium models with LoRA, or handle professional rendering and visualization workloads without stepping up to H100 pricing.

GPU ArchitectureNVIDIA Blackwell
VRAM96 GB GDDR7 ECC
Memory Bandwidth1.79 TB/s

Technical specifications

GPU Architecture
NVIDIA Blackwell
VRAM
96 GB GDDR7 ECC
Memory Bandwidth
1.79 TB/s
Tensor Cores
5th Gen (752 cores)
CUDA Cores
24,064
RT Cores
4th Gen (188 cores)
FP32 Performance
126 TFLOPS
FP16 Tensor (dense)
504 TFLOPS
FP8 Tensor (dense)
1,008 TFLOPS
FP4 Tensor (dense)
2,016 TFLOPS
Form Factor
Workstation (dual-slot PCIe)
Interconnect
PCIe Gen5 x16
NVLink
Not supported
TDP
600W (Max-Q: 300W)

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$0.59/hr-
Vast.ai
$1.00/hr1.7x more expensive
Hyperstack
$1.80/hr3.1x more expensive
RunPod
$1.69/hr2.9x more expensive
CoreWeave
$2.50/hr4.2x more expensive
Custom & Reserved

Need More RTX PRO 6000 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more RTX PRO 6000 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the RTX PRO 6000

Scenario 01

Pick RTX PRO 6000 Blackwell if

You want to run 30B-70B LLMs at FP8 on a single GPU without paying H100 rates. 96GB GDDR7 lets Llama 3.3 70B FP8, Qwen 2.5 32B FP16, and 70B AWQ models fit comfortably with KV cache headroom. Best single-GPU VRAM capacity below the H100/H200 price tier.

Recommended fit
Scenario 02

Pick RTX 5090 instead if

Your models fit in 32GB and you want the cheapest Blackwell hourly rate. RTX 5090 matches PRO 6000 on memory bandwidth (1.79 TB/s) and FP4 support, but lacks ECC and caps out at 32GB. Great for 7B-13B inference, SDXL, and Flux.

Recommended fit
Scenario 03

Pick L40S instead if

You need a datacenter-certified SKU with 48GB ECC and long-term multi-tenant support, and you don't need Blackwell FP4. L40S is purpose-built for inference serving and is widely available across hyperscalers.

Recommended fit
Scenario 04

Pick H100 or B200 instead if

You need HBM bandwidth (3.35-8 TB/s) and NVLink for multi-GPU tensor parallelism on 100B+ models. PCIe PRO 6000 has no NVLink, so scale-out is limited to data parallelism. For trillion-parameter training, go B200.

Recommended fit

Ideal use cases

Use case / 01
🎨

Professional Rendering

Leverage 4th generation RT Cores and Blackwell architecture for real-time ray tracing, CAD/CAM workflows, and digital content creation.

Real-time ray tracing for architectural visualizationCAD/CAM design and engineering workflowsDigital content creation and VFX pipelinesProduct design and photorealistic rendering
Use case / 02
🧠

AI Development & Fine-Tuning

Perfect for fine-tuning 7B-32B models and running 70B FP8 on a single GPU with 96GB of GDDR7 ECC memory.

LoRA and QLoRA fine-tuning of 7B-32B modelsLlama 3.3 70B FP8 and 70B AWQ inferenceQwen 2.5 32B FP16 fine-tuning with headroom for KV cacheTransfer learning and domain adaptation
Use case / 03

AI Inference

Cost-effective inference for 30B-70B models on a single GPU, with FP4 and FP8 Tensor Core acceleration.

Llama 3.3 70B FP8 and 70B AWQ on a single GPUReal-time image generation and diffusion modelsProduction inference APIs with dynamic batchingEdge AI and embedded deployment testing
Use case / 04
🔬

Scientific Visualization

Accelerate medical imaging, molecular visualization, and engineering simulation with professional-grade GPU compute.

Medical imaging and DICOM visualizationMolecular dynamics and protein structure visualizationEngineering simulation and CFD post-processingGeospatial data analysis and 3D mapping

Performance benchmarks

Llama 3.1 8B Inference
~8,990 tokens/s
vLLM, batched aggregate
Llama 3.1 70B Inference
~24,000 tok/s
vLLM FP8, 100 concurrent requests (aggregate)
30B AWQ Throughput
~8,400 tokens/s
matches 4x RTX 4090 (CloudRift)
SDXL 1024x1024
~11 img/min
FP16, base + refiner
Memory Bandwidth
1.79 TB/s
GDDR7, 512-bit bus
vs RTX 6000 Ada
~2x faster
Blackwell FP4 + 2x VRAM

Serve Llama 3.3 70B FP8 on a single RTX PRO 6000

96GB GDDR7 is enough to load Llama 3.3 70B at FP8 (~70GB weights) with room for KV cache at moderate batch sizes. vLLM gives you an OpenAI-compatible endpoint in one command.

bash
Spheron
# SSH into your RTX PRO 6000 instancessh root@<instance-ip> # Install vLLM with CUDA 12.4+ (Blackwell FP8 kernels)pip install vllm>=0.6.3 # Launch Llama 3.3 70B at FP8vllm serve meta-llama/Llama-3.3-70B-Instruct \  --quantization fp8 \  --max-model-len 8192 \  --gpu-memory-utilization 0.92 # Test the endpointcurl http://localhost:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{"model":"meta-llama/Llama-3.3-70B-Instruct","messages":[{"role":"user","content":"Hello"}]}'

For 30B-class models (Qwen 2.5 32B, Mixtral 8x7B), FP16 fits comfortably and lets you serve higher concurrency.

Related resources

Frequently asked questions

How does RTX PRO 6000 compare to RTX A6000?

The RTX PRO 6000 Blackwell delivers roughly 2x the AI throughput of the RTX A6000 / RTX 6000 Ada. Key improvements: 96GB GDDR7 ECC (vs 48GB GDDR6 on Ada), 5th generation Tensor Cores with native FP4 and FP8 support, 4th generation RT Cores, 24,064 CUDA cores (vs 18,176), and 1.79 TB/s memory bandwidth (vs 960 GB/s). FP4 support is the bigger unlock for LLM inference, doubling throughput vs FP8 on compatible workloads.

Is RTX PRO 6000 suitable for AI training?

Yes. The RTX PRO 6000 Blackwell is a strong fit for fine-tuning up to 32B parameter models and LoRA/QLoRA on 70B models. 96GB GDDR7 ECC with 1.79 TB/s bandwidth handles most production fine-tuning scenarios on a single GPU. For full pre-training runs or tensor-parallel training of 70B+ models, use H100/H200/B200 with HBM memory and NVLink, since PRO 6000 is a PCIe workstation card without NVLink.

What makes RTX PRO 6000 a 'PRO' GPU?

The 'PRO' designation indicates enterprise-grade features: professional vGPU drivers for virtualization support, ECC memory for data integrity, ISV certifications for industry-standard applications (Autodesk, Dassault, Siemens), and professional visualization features including enhanced ray tracing and viewport rendering. These features ensure reliability and compatibility for mission-critical professional workflows.

Can I run LLMs on RTX PRO 6000?

Yes, and this is where the PRO 6000 Blackwell is strongest. 96GB GDDR7 ECC fits Llama 3.3 70B at FP8 (~70GB), 70B AWQ (~40GB), Qwen 2.5 32B at FP16 (~64GB), and 30B-class models at FP16 with ample KV cache headroom. Only Llama 70B FP16 (~140GB) exceeds the capacity, and for that you need H200 (141GB) or B200 (192GB). For most production inference, the PRO 6000 lets you serve modern LLMs on a single GPU at a lower hourly rate than H100.

What rendering software is supported?

The RTX PRO 6000 is certified and optimized for all major rendering and design applications: Blender, Autodesk Maya, Autodesk 3ds Max, Cinema 4D, V-Ray, KeyShot, and NVIDIA Omniverse. ISV certifications ensure full compatibility and optimized performance with professional workflows.

How does RTX PRO 6000 compare to H100 for AI?

PRO 6000 Blackwell has more VRAM (96GB GDDR7 ECC vs 80GB HBM3 on H100 SXM), but lower memory bandwidth (1.79 TB/s vs 3.35 TB/s) and no NVLink. H100 wins on raw bandwidth for training and tensor parallelism. PRO 6000 wins on hourly cost and capacity for single-GPU inference of 30B-70B models, plus it adds Blackwell FP4 support that H100 lacks. For models that fit in 96GB and aren't bandwidth-bound, PRO 6000 is the cheaper pick.

What's the minimum rental period?

There is no minimum rental period. Spheron offers per-minute billing for RTX PRO 6000 instances, so you only pay for the exact compute time you use. Start and stop instances at any time with no long-term commitment required.

Can I use RTX PRO 6000 for video editing and encoding?

Yes. The RTX PRO 6000 features four 9th generation NVENC encoders with AV1 and 4:2:2 H.264/HEVC hardware encoding support, plus 6th generation NVDEC decoders. That combination makes it a strong fit for professional video production pipelines, real-time editing, and high-throughput media transcoding workflows.

What regions are available for RTX PRO 6000?

RTX PRO 6000 instances are available in US, Europe, and Canada regions. Availability may vary by region based on current demand. Check the Spheron app at app.spheron.ai for real-time availability and region selection.

Do you offer technical support for RTX PRO 6000?

Yes! Our team provides technical support to help you optimize your GPU workloads. We offer assistance with deployment, performance tuning, and troubleshooting. Enterprise customers get dedicated support channels and architecture review sessions.

Book a call with our team

What's the difference between dedicated and spot RTX PRO 6000 instances?

Dedicated RTX PRO 6000 instances are non-interruptible, run on a 99.99% SLA, and bill per-minute at the on-demand rate. Spot instances run on spare capacity at meaningfully lower rates but can be preempted when dedicated demand rises. Use spot for fault-tolerant workloads: batch inference, QLoRA fine-tuning with checkpointing every 15-30 minutes, or hyperparameter sweeps. Use dedicated for customer-facing inference endpoints, rendering pipelines with hard deadlines, or any job where an interruption would cause data loss. Both tiers live in the same control plane, so you can mix them across a project.

Also consider