How much faster is the B200 compared to the H100?

The B200 delivers approximately 2x faster GPT-3 175B training and approximately 4x higher LLM inference throughput compared to the H100 SXM5, per MLPerf Training v4.1 and Inference v5.0 results. For inference, the B200 reaches around 12,305 tokens/sec on Llama 2 70B at FP4 in server mode (per MLPerf Inference v5.0), versus roughly 3,000 tokens/sec for the H100. Native FP4 support is what drives this advantage on Blackwell.

How much does the B200 cost to rent?

Spheron offers B200 GPU rental at $6.03/hr on-demand or $2.25/hr spot pricing as of March 18, 2026. Spot pricing makes Spheron the most affordable option for interruptible workloads. Other providers include RunPod at $4.99/hr, Nebius at $5.50/hr, Lambda Labs from $4.99/hr, and major hyperscalers (AWS, Azure) at approximately $14.25/hr on-demand. Pricing can fluctuate over time based on GPU availability.

What is the B200's memory capacity?

The NVIDIA B200 has 192 GB of HBM3e memory with 8.0 TB/s bandwidth. This is 2.4x more memory than the H100 SXM (80 GB) and allows a 70B parameter model to run at FP16 on a single GPU with headroom for large KV caches. An 8-GPU B200 system provides 1.5 TB of total GPU memory.

When should I upgrade from H100 to B200?

Upgrade to B200 when: your models require more than 80 GB VRAM (70B+ at FP16, or 100B+ at any precision), you need native FP4 inference support for maximum throughput, your production inference costs are a primary concern, or you're serving 128K+ context windows where KV cache pressure fills an H100. Stick with H100 when running models under 70B parameters, for cost-sensitive training workloads, or when your current H100 throughput meets your SLA.

What is the difference between B200 and B300?

The B300 (Blackwell Ultra) is the binned, higher-spec version of B200. Key differences: B300 has 288 GB VRAM vs B200's 192 GB, 14,000 vs 9,000 FP4 TFLOPS (dense), 1,400W vs 1,000W TDP, and ConnectX-8 vs ConnectX-7 networking. The B200 is more widely available and has a mature software stack. As of March 18, 2026, B200 spot starts at $2.25/hr on Spheron (on-demand $6.03/hr), while B300 on-demand is $3.50/hr on Spheron with spot also available. B300 is worth it when a single GPU must hold 70B+ models without quantization or when FP4 throughput is the primary bottleneck.

NVIDIA B200 Guide: Specs, Benchmarks, Cloud Pricing & H100 Upgrade

The B200 is the Blackwell GPU available right now. The B200 ships today at $2.25/hr spot (or $6.03/hr on-demand) on Spheron, delivering approximately 2x the training throughput of an H100 with 192 GB of HBM3e and native FP4 support. If you're running 70B+ models or paying for H100 inference at scale, the upgrade math is straightforward. Start with B200 GPU rental on Spheron and read on for the full picture.

B200 Specs: The Full Blackwell Picture

Everything that matters about the B200 comes down to three headline figures: 192 GB memory, 8.0 TB/s bandwidth, and 9,000 TFLOPS FP4 (dense). Here's the complete spec table pulled directly from NVIDIA's B200 data sheet:

Spec	Value
Architecture	NVIDIA Blackwell
VRAM	192 GB HBM3e
Memory Bandwidth	8.0 TB/s
Tensor Cores	5th Generation
CUDA Cores	16,896
FP64 Performance	37 TFLOPS
FP32 Performance	75 TFLOPS
BF16/FP16 Performance	2,250 TFLOPS (dense)
FP8 Performance	4,500 TFLOPS (dense)
FP4 Performance	9,000 TFLOPS (dense)
TDP	1,000W
NVLink Bandwidth	1.8 TB/s bidirectional

What these numbers mean in practice:

192 GB VRAM puts a full 70B parameter model in FP16 on a single GPU. H100 maxes out at 80 GB, meaning 70B at FP16 requires tensor-parallel sharding across two GPUs. The B200 eliminates that complexity entirely. For 100B+ models, the B200 gets you further before you need multi-GPU coordination.

8.0 TB/s memory bandwidth is roughly 2.4x the H100's 3.35 TB/s. For large-batch inference, memory bandwidth is often the real bottleneck, not raw TFLOPS. At 8 TB/s, the B200 sustains high throughput even at batch sizes where the H100 starts to stall.

9,000 TFLOPS FP4 (dense) is a capability H100 simply does not have. The 5th-generation Tensor Cores introduce native FP4 support, doubling effective throughput compared to FP8 for models that can tolerate the precision reduction. See the FP4 quantization guide for a deep dive on quality tradeoffs.

1,000W TDP requires high-density power delivery. Not a drop-in swap into air-cooled infrastructure, but lower than the B300's 1,400W, and manageable in most modern datacenters with proper PDU planning.

B200 vs H100: When the Upgrade Is Worth It

Here's how the B200 stacks up against the H100 SXM5 across the specs that matter most for LLM workloads. For a deeper look at H100 variants and H200 comparisons, see our H100 vs H200 benchmark guide.

Spec	B200	H100 SXM
Architecture	Blackwell	Hopper
VRAM	192 GB HBM3e	80 GB HBM3
Memory Bandwidth	8.0 TB/s	3.35 TB/s
FP4 Dense (TFLOPS)	9,000	N/A
FP8 Dense (TFLOPS)	4,500	~1,979
FP16 Dense (TFLOPS)	2,250	~989
TDP	1,000W	700W
NVLink Bandwidth	1.8 TB/s	900 GB/s

All values are dense (non-sparse). H100 also publishes 2:4 structured-sparsity figures (~3,958 TFLOPS FP8 sparse) which require sparse weight patterns to activate.

Use B200 When:

Your models need more than 80 GB. Llama 3.1 70B at FP16 is ~140 GB. At FP8 it's ~70 GB, which barely fits an H100 with no room for KV cache at scale. B200's 192 GB handles this comfortably. For 100B+ models, B200 is often the only practical single-GPU option.

You're running high-traffic inference APIs. The 4,500 TFLOPS FP8 (dense) is roughly 2.3x the H100's 1,979 TFLOPS. More throughput per GPU means fewer GPUs to serve the same request volume.

FP4 inference is viable for your workload. At 9,000 TFLOPS FP4 (dense), the B200 doubles effective throughput compared to FP8. If your model tolerates FP4 quality (most production inference does), cost per token drops sharply.

Long context windows are required. At 128K context length on a 70B model, KV cache alone consumes 30-50 GB. An H100 with 80 GB total doesn't leave enough room for model weights plus KV cache at this scale. B200's 192 GB handles it.

Stick With H100 When:

Your models fit in 80 GB. For 7B to 34B models, the H100 provides strong throughput at $1.49-2.99/hr. No reason to pay more for VRAM you won't use.

Training is the primary workload. For distributed training, the H100 is still the best cost-per-FLOP option. The B200's FP4 advantage is inference-specific; training workloads don't see the same proportional gains.

Budget is the constraint. Eight H100s at $2/hr totals $16/hr. Eight B200s at $6.03/hr on-demand totals $48.24/hr on Spheron, roughly 3x the cost of H100. For cost-sensitive training or inference workloads where 80 GB VRAM is sufficient, H100 remains more economical at on-demand rates.

The Cost-Per-Token Math

Cost per hour is the wrong metric. What matters is cost per output token, which combines throughput and price. Based on benchmarks (see the Real-World Benchmarks section below):

GPU	Est. Throughput (Llama 2 70B)	Price/hr	Relative Cost-per-Token
H100 SXM	~3,000 tok/s	$2.00	1.0x (baseline)
B200 (FP8, est.)	~6,000 tok/s	$6.03	~1.51x (51% more expensive)
B200 (FP4, measured)	~12,305 tok/s	$6.03	~0.74x (26% cheaper)

At on-demand pricing ($6.03/hr for B200 vs $2.00/hr for H100), FP4 inference is about 26% cheaper per token. FP8 at on-demand rates is roughly 51% more expensive per token than H100. The B200's on-demand cost advantage only applies when using FP4. At spot pricing ($2.25/hr), the per-token advantage grows substantially. The FP4 throughput figure is measured from MLPerf Inference v5.0 server mode results (Llama 2 70B). The FP8 figure is estimated based on TFLOPS and memory bandwidth ratios relative to H100. Treat estimates as directional and run your own benchmarks before committing.

B200 vs B300: What the Ultra Adds

The B300 (Blackwell Ultra) is the high-binned version of B200. Same base architecture, pushed further. For a full analysis of the B300 specifically, see our NVIDIA B300 Blackwell Ultra guide.

Spec	B300	B200	H200	H100
Architecture	Blackwell Ultra	Blackwell	Hopper	Hopper
VRAM	288 GB HBM3e	192 GB HBM3e	141 GB HBM3e	80 GB HBM3
Memory Bandwidth	8 TB/s	8 TB/s	4.8 TB/s	3.35 TB/s
FP4 Dense (TFLOPS)	14,000	9,000	N/A	N/A
FP8 Dense (TFLOPS)	7,000	4,500	~1,979	~1,979
FP16 Dense (TFLOPS)	3,500	2,250	~989	~989
TDP	1,400W	1,000W	700W	700W
Interconnect	NVLink 5 (1.8 TB/s)	NVLink 5 (1.8 TB/s)	NVLink 4 (900 GB/s)	NVLink 4 (900 GB/s)
Networking	ConnectX-8 (1.6T)	ConnectX-7 (800G)	ConnectX-7 (800G)	ConnectX-7 (800G)

All values are dense (non-sparse).

The B300 adds 50% more VRAM (288 GB vs 192 GB), 55.6% more FP4 compute (14,000 vs 9,000 TFLOPS), 40% higher TDP (1,400W vs 1,000W), and doubles inter-node networking via ConnectX-8. The networking upgrade matters specifically for multi-node training where inter-GPU communication is the bottleneck.

B200 is the right call for most teams today. Supply is more readily available, spot pricing starts at $2.25/hr on Spheron vs B300 on-demand at $3.50/hr on Spheron (B300 spot pricing is also available), and the software stack is more mature. B300 is worth considering when a single GPU must hold a 70B+ model without quantization, or when FP4 throughput is the primary constraint and the extra 5,000 TFLOPS dense FP4 makes a measurable difference for your workload.

FP4 and FP8: The Blackwell Precision Advantage

The B200's 5th-generation Tensor Cores introduce two precision capabilities that H100 and H200 cannot match: native hardware-accelerated FP4 inference, and significantly higher FP8 throughput.

Here's what changes with Blackwell:

FP4 (native): The B200 delivers 9,000 TFLOPS FP4 dense. H100 has no native FP4 hardware support. For inference, FP4 doubles effective throughput compared to FP8 on the same GPU. With 4-bit weights, four times as many parameters fit per unit of memory bandwidth compared to FP16, which directly translates to higher batch throughput for memory-bandwidth-bound workloads.

FP8 at scale: B200's 4,500 TFLOPS FP8 dense is roughly 2.3x the H100's 1,979 TFLOPS. This improvement applies to any FP8 workload without requiring FP4 quantization.

Framework support for FP4:

TensorRT-LLM 0.17+ with native FP4 kernels
vLLM with VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable
SGLang with FP4 quantization support
NVIDIA's published FP4 quantization toolkit for calibrating existing models

FP4 quality tradeoffs are model and task dependent. Most production inference tasks (classification, summarization, standard chat completion) show minimal degradation. Tasks requiring precise numerical outputs or fine-grained semantic distinctions may see measurable quality loss at FP4. Always validate quality on your specific use case before moving production traffic to FP4. See the FP4 quantization guide for benchmark data and a decision framework.

Real-World Benchmarks

These figures are sourced from MLPerf Training v4.1 and MLPerf Inference v5.0 results, plus NVIDIA published estimates where indicated:

Benchmark	B200 Result	Comparison
GPT-3 (175B) Training	~2x faster	vs H100 SXM5 (MLPerf Training v4.1)
LLM Inference Throughput	~12,305 tokens/s	Llama 2 70B FP4, server mode (MLPerf Inference v5.0)
Mixture-of-Experts Inference	~2.1x faster	vs H200 (MLPerf Inference v5.0, Mixtral 8x7B)
Multi-Modal Model Training	~2.8x faster	vs H100 SXM5 (NVIDIA published estimate)
Stable Diffusion XL	~1.6x faster	1024x1024 inference vs H200 (server mode, MLPerf Inference v5.0)
Memory Capacity	2.4x larger	vs H100 80GB

The inference throughput figure (~12,305 tok/s for Llama 2 70B FP4 in server mode, per MLPerf Inference v5.0) is the key number for inference cost calculations. Using H100 as a ~3,000 tok/s baseline, the B200 delivers roughly 4x throughput at a small hourly premium on Spheron, which translates directly to lower cost per token.

For VRAM-constrained workloads, the 2.4x memory advantage (192 GB vs 80 GB) often enables workloads that would require multi-GPU H100 configurations to run on a single B200, reducing system complexity and inter-GPU communication overhead.

For more detailed analysis across GPU generations, see the GPU selection guide for LLMs.

Cloud Pricing Across Providers

B200 pricing as of March 18, 2026. Prices can fluctuate over time based on availability of the GPUs.

Provider	Price Per GPU/hr	Notes
Spheron (spot)	$2.25/hr	Lowest spot price available
Spheron (on-demand)	$6.03/hr	On-demand, no commitment
RunPod	$4.99/hr	Secure Cloud
Nebius	$5.50/hr	On-demand
Lambda Labs	from $4.99/hr	On-demand
Azure	est. $14.25/hr	On-demand
AWS (p6 instance)	est. $14.25/hr	On-demand
Google Cloud	est. $18.50/hr	On-demand

Prices are based on March 18, 2026 and can fluctuate over time based on availability. Check current GPU pricing for live rates.

For interruptible workloads like batch inference, experimentation, and development, Spheron's spot pricing at $2.25/hr is the lowest available B200 rate. For production inference and long-running training where interruption is not acceptable, on-demand pricing applies. Spheron's on-demand at $6.03/hr is competitive with other providers, and spot at $2.25/hr vs $14.25/hr on AWS is still a 6x difference for the same GPU. At scale that gap compounds quickly: 100 on-demand GPUs running continuously for a month is $434,160 on Spheron vs over $1,026,000 on AWS. The hardware is identical. The difference is infrastructure overhead, margin structure, and the provider's cost to procure and operate B200 capacity.

B200 pricing will compress over the next 6-12 months as supply ramps. The H100 followed the same pattern: $8/hr at launch down to under $3/hr by 2026. Spot pricing captures the B200's performance advantage before on-demand rates normalize.

Availability and Infrastructure Requirements

Availability: B200 is more widely available than B300 as of Q1 2026. Spheron has B200 inventory available as both Spot and Dedicated instances. Spot pricing gives access at reduced rates for interruptible workloads (batch inference, experimentation, non-time-sensitive jobs). Dedicated instances guarantee availability for production inference and long-running training.

Power: At 1,000W per GPU, an 8-GPU B200 system draws 7-8 kW from GPUs alone. This is substantial but lower than B300's 1,400W per GPU (11.2 kW for 8 GPUs). Most modern high-density datacenters can handle B200 TDP with appropriate PDU configuration. Unlike B300, liquid cooling is not strictly mandatory, though it's recommended for sustained peak load operation.

Cooling: Air cooling is viable for B200 at 1,000W per GPU in well-ventilated high-density racks. Liquid cooling (direct liquid cooling or rear-door heat exchangers) is recommended for dense deployments and will extend hardware longevity. This contrasts with B300 at 1,400W, where liquid cooling is effectively mandatory.

NVLink: B200 uses NVLink 5 at 1.8 TB/s per GPU bidirectional, the same spec as B300, and 2x over H100's NVLink 4 (900 GB/s). In an 8-GPU configuration, this provides near-linear scaling for distributed inference and training with minimal communication overhead. Specific NVLink features include:

NVLink 5.0 with 1.8 TB/s per GPU bidirectional bandwidth
Full NVSwitch connectivity for 8-GPU systems
Unified memory addressing across all GPUs in the system
Direct GPU-to-GPU communication without CPU involvement
Sub-100ns GPU-to-GPU latency

Software compatibility: The B200 uses the same CUDA toolchain as B300 (CUDA 12.x, cuDNN 9.x). Any code running on H100 or H200 runs on B200 without modification. FP4 capabilities require TensorRT-LLM 0.17+ or vLLM with FP4 support enabled.

Getting Started with B200 on Spheron

Spheron has B200 GPUs available as both Spot and Dedicated instances. Spot pricing at $2.25/hr is the lowest B200 spot rate currently available, suitable for batch inference, development, and interruptible workloads. On-demand pricing is $6.03/hr as of March 18, 2026, and can fluctuate based on GPU availability. Dedicated instances guarantee availability for production inference services and long-running training jobs where interruption would be costly.

Access is through Spheron's console, with B200 available alongside H200, H100, B300, and other GPUs from multiple providers. No contracts, no minimum commitments for spot instances. Check GPU pricing for current rates and availability.

For teams currently running H100 workloads: the migration path is straightforward. Docker containers and Python environments built for H100 run on B200 without changes. FP4 optimization is an optional additional step, not a requirement to benefit from the B200's higher memory and FP8 throughput.

For GPU selection across models and use cases, see our complete guide to GPU requirements for LLMs and the GPU selection guide.