Spheron GPU Catalog

Rent NVIDIA B200 GPUs on Demand from $1.71/hr

192GB HBM3e Blackwell, built for trillion-parameter training and 100B+ LLM inference.

At a glance

You can rent an NVIDIA B200 on Spheron starting at $1.71/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with spot pricing cheaper still. Per-minute billing, no contracts, and 8-GPU HGX B200 nodes deploy via NVLink 5.0 with 1.8 TB/s GPU-to-GPU bandwidth. Each B200 ships with 192GB HBM3e, 8 TB/s memory bandwidth, and a 2nd-gen Transformer Engine with native FP4 support, delivering roughly 2x faster LLM training and up to 15x faster inference than H100 at FP4 (per MLPerf). Designed for frontier-scale workloads: 1T+ parameter training, 100B+ parameter inference serving, and multi-modal foundation models where HBM capacity and NVLink bandwidth are the bottleneck.

GPU ArchitectureNVIDIA Blackwell
VRAM192 GB HBM3e
Memory Bandwidth8.0 TB/s

Technical specifications

GPU Architecture
NVIDIA Blackwell
VRAM
192 GB HBM3e
Memory Bandwidth
8.0 TB/s
Tensor Cores
5th Generation
CUDA Cores
20,480
FP64 Performance
40 TFLOPS
FP32 Performance
80 TFLOPS
TF32 Performance
1,125 TFLOPS (dense)
FP8 Performance
4,500 TFLOPS (dense)
FP4 Performance
9,000 TFLOPS (dense)
System RAM
184 GB DDR5
vCPUs
32 vCPUs
Storage
250 GB NVMe Gen5
Network
NVLink 1.8TB/s
TDP
1000W

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$1.71/hr-
Lambda Labs
$6.08/hr3.6x more expensive
Nebius
$5.50/hr3.2x more expensive
CoreWeave (SXM)
$8.60/hr5.0x more expensive
CoreWeave (NVL)
$10.50/hr6.1x more expensive
AWS (p6-b200)
est. $12.00/hr7.0x more expensive
Custom & Reserved

Need More B200 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more B200 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the B200

Scenario 01

Pick B200 if

You're training frontier models (1T+ parameters), serving 100B+ parameter LLMs in production, or running MoE architectures that need the extra HBM capacity and NVLink bandwidth. FP4 support cuts inference cost per token roughly in half vs H100 FP8. If your model already maxes out 80GB on H100, B200 is the direct step up.

Recommended fit
Scenario 02

Pick H100 instead if

Your model fits in 80GB and you want the best price per hour for 70B-class training or inference. H100 is mature, has broad framework support, and costs significantly less per GPU-hour. B200 is overkill for anything under ~100B parameters.

Recommended fit
Scenario 03

Pick H200 instead if

You need 141GB HBM3e to fit larger contexts or KV cache without the full Blackwell price bump. H200 is a drop-in upgrade from H100 and a popular middle ground for serving 70-180B parameter models.

Recommended fit
Scenario 04

Pick B300 or GB200 instead if

You want Blackwell Ultra (B300) with 288GB HBM3e per GPU, or the GB200 Grace-Blackwell Superchip pairing two B200s with a Grace CPU over a 900 GB/s NVLink-C2C link. Both target the largest possible training runs and enterprise-scale reasoning models.

Recommended fit

Ideal use cases

Use case / 01
🌐

Trillion-Parameter Model Training

Train the next generation of foundation models at exceptional scale, leveraging 192GB memory and 2nd-gen Transformer Engine.

GPT-4 scale models with 1T+ parametersMulti-modal foundation models (text, image, video, audio)Scientific foundation models for drug discoveryMixture-of-Experts (MoE) architectures at scale
Use case / 02
💬

Advanced LLM Inference

Deploy ultra-large language models for production inference with industry-leading throughput and lowest cost per token.

Real-time inference for 100B+ parameter LLMsMulti-turn conversational AI with long contextRetrieval-augmented generation (RAG) at scaleAgent-based AI systems with reasoning capabilities
Use case / 03

Generative AI at Scale

Power next-generation generative AI applications with support for advanced diffusion models and multi-modal generation.

High-resolution video generation (4K/8K)Real-time 3D asset generation and renderingMusic and audio synthesis modelsCode generation for enterprise applications
Use case / 04
🔬

AI Research & Innovation

Push the boundaries of AI research with cutting-edge hardware designed for experimental architectures and novel approaches.

Novel neural architecture developmentMulti-agent reinforcement learning at scaleQuantum machine learning simulationsBrain-scale neural network simulation

Performance benchmarks

Llama 2 70B Inference
~12,300 tok/s
FP4, server mode (MLPerf Inference v5.0)
GPT-3 175B Training
~2x faster
vs H100 SXM5 (MLPerf Training)
Llama 3.1 405B Training
~2.2x per-GPU
vs H100 SXM5 (NVIDIA)
FP4 Throughput
~2x vs FP8
2nd-gen Transformer Engine
Memory Capacity
2.4x larger
vs H100 80GB (192GB vs 80GB)
Memory Bandwidth
2.4x faster
vs H100 SXM5 (8.0 vs 3.35 TB/s)

Serve Llama 3.1 405B on 8x B200 with vLLM + FP4

8-GPU HGX B200 node has 1.5TB unified HBM, enough to serve Llama 3.1 405B in FP4 with a 32K+ context window. vLLM enables tensor parallelism across NVLink for low-latency inference.

bash
Spheron
# SSH into your 8x B200 HGX nodessh root@<instance-ip> # NVIDIA PyTorch 24.10+ container has Blackwell + FP4 kernelsdocker run --gpus all --ipc=host --ulimit memlock=-1 \  -p 8000:8000 -v $HOME/.cache:/root/.cache \  nvcr.io/nvidia/pytorch:24.10-py3 bash pip install vllm>=0.6.3 # Launch Llama 3.1 405B with FP4 quantization across 8 GPUsvllm serve meta-llama/Llama-3.1-405B-Instruct \  --tensor-parallel-size 8 \  --quantization fp4 \  --max-model-len 32768 \  --gpu-memory-utilization 0.95 # Test the endpointcurl http://localhost:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{"model":"meta-llama/Llama-3.1-405B-Instruct","messages":[{"role":"user","content":"Hello"}]}'

On an 8x B200 node, expect 5-8x higher tokens/sec than an 8x H100 node at FP4 thanks to the 2nd-gen Transformer Engine and NVLink 5.0.

Interconnect fabric

NVLink Switch Configuration

B200 GPUs feature the latest NVLink switch technology providing 1.8 TB/s bidirectional bandwidth per GPU. This enables near-linear scaling for multi-GPU training of trillion-parameter models with minimal communication overhead.

01NVLink 5.0 with 1.8 TB/s per GPU bandwidth
0218x bandwidth improvement over PCIe Gen5
03Full NVSwitch connectivity for 8-GPU systems
04Unified memory addressing across all GPUs
05Direct GPU-to-GPU communication without CPU
06Support for NVIDIA SHARP for in-network computing
07Optimized for DeepSpeed ZeRO-3 and FSDP
08Sub-100ns GPU-to-GPU latency
Scale

Need a custom multi-node cluster or reserved capacity?

B200 vs alternatives

Related resources

Frequently asked questions

What makes B200 different from H100?

B200 features the Blackwell architecture with 2.5x performance improvement for AI workloads. Key differences include: 192GB HBM3e memory (2.4x more than H100), 8 TB/s memory bandwidth (2.4x faster), 5th generation Tensor Cores with FP4 precision support, and enhanced Transformer Engine. B200 is specifically designed for trillion-parameter models and next-gen AI applications.

Is B200 available for immediate deployment?

B200 GPUs are currently in limited availability with early access program. Spheron is working directly with major Data Center providers to secure allocation for our customers. Contact our team to discuss your requirements and timeline. Priority is given to large-scale training workloads and research institutions.

Book a call with our team

What is FP4 precision and why does it matter?

FP4 (4-bit floating point) is a new precision format introduced with Blackwell architecture. It enables 2x throughput compared to FP8 while maintaining model accuracy for inference workloads. This dramatically reduces cost per token for LLM inference and enables larger models to fit in memory. The 2nd-gen Transformer Engine automatically handles mixed FP4/FP8/FP16 precision.

Can I train trillion-parameter models on B200?

Yes! B200 is specifically designed for trillion-parameter scale. With 192GB per GPU and NVLink switch providing 1.8 TB/s bandwidth, you can efficiently train models up to 2T+ parameters using distributed training frameworks like DeepSpeed, Megatron-LM, or FSDP. An 8-GPU B200 system provides 1.5TB of unified GPU memory.

What frameworks are optimized for B200?

All major frameworks have B200 support: PyTorch 2.2+, TensorFlow 2.15+, JAX 0.4.20+. NVIDIA provides Blackwell-optimized containers with CUDA 12.4, cuDNN 9.0, and framework-specific optimizations. Support includes new features like FP4 precision, enhanced Transformer Engine, and improved NCCL for multi-GPU scaling.

How does NVLink switch improve performance?

NVLink switch provides 1.8 TB/s bidirectional bandwidth per GPU (18x faster than PCIe Gen5), enabling GPUs to communicate directly without CPU bottlenecks. This is crucial for distributed training where gradient synchronization can be a major bottleneck. With 8 B200s connected via NVLink, you get near-linear scaling efficiency (90%+) even for largest models.

What's the cost comparison vs purchasing B200 hardware?

B200 GPUs cost $30,000-40,000 each when available for purchase, and an HGX B200 8-GPU server lands in the $400K-500K range before infrastructure (power, cooling, networking, 400G InfiniBand). Factor in DC space, a ~10kW-per-GPU power budget, and 3-5 year depreciation. For most teams, on-demand rental at Spheron's rates is far more cost-effective unless you have sustained 24/7 utilization above ~70%. Rental also avoids the 6-12 month lead times currently on new Blackwell hardware.

Can I use B200 for inference only?

Absolutely! B200 provides exceptional inference performance with FP4 precision support, delivering up to 9,000 TFLOPS. It can serve very large models (100B+ parameters) with high throughput. However, for inference-only workloads under 70B parameters, you might find better cost-efficiency with H100 or A100 GPUs.

What kind of workloads benefit most from B200?

B200 excels at: trillion-parameter model training, very large LLM inference (100B+ params), multi-modal foundation models, mixture-of-experts architectures, high-resolution generative AI (video, 3D), and scientific computing requiring massive memory. If your model is under 100B parameters or fits comfortably in H100 memory, H100 or A100 may be more cost-effective.

Do you offer dedicated B200 clusters?

Yes! For enterprise customers and research institutions, we offer dedicated B200 clusters with custom configurations (8-512 GPUs), reserved capacity, and volume pricing. Dedicated clusters include priority support, custom networking, and flexible billing. Contact our enterprise team to discuss your requirements.

Book a call with our team

What's the difference between dedicated and spot B200 instances?

Dedicated B200 instances are non-interruptible, run on a 99.99% SLA, and bill per-minute at the on-demand rate. Spot instances run on spare capacity at meaningfully lower rates but can be preempted when dedicated demand rises. Use spot for fault-tolerant workloads: batch inference, hyperparameter sweeps, or any training loop with frequent checkpointing. For trillion-parameter training runs where a preemption costs days of progress, always use dedicated. Both tiers live in the same control plane, so you can mix them across a project (e.g., dedicated for the main training job, spot for evaluation jobs).

Also consider