Engineering

NVIDIA B200 Guide: Specs, Benchmarks, Cloud Pricing & H100 Upgrade

Back to BlogWritten by Mitrasish, Co-founderMar 19, 2026
NVIDIA B200BlackwellGPU SpecsAI InfrastructureGPU CloudB200 vs H100FP4GPU Benchmarks
NVIDIA B200 Guide: Specs, Benchmarks, Cloud Pricing & H100 Upgrade

The B200 is the Blackwell GPU available right now. The B200 ships today at $2.25/hr spot (or $6.03/hr on-demand) on Spheron, delivering approximately 2x the training throughput of an H100 with 192 GB of HBM3e and native FP4 support. If you're running 70B+ models or paying for H100 inference at scale, the upgrade math is straightforward. Start with B200 GPU rental on Spheron and read on for the full picture.

B200 Specs: The Full Blackwell Picture

Everything that matters about the B200 comes down to three headline figures: 192 GB memory, 8.0 TB/s bandwidth, and 9,000 TFLOPS FP4 (dense). Here's the complete spec table pulled directly from NVIDIA's B200 data sheet:

SpecValue
ArchitectureNVIDIA Blackwell
VRAM192 GB HBM3e
Memory Bandwidth8.0 TB/s
Tensor Cores5th Generation
CUDA Cores16,896
FP64 Performance37 TFLOPS
FP32 Performance75 TFLOPS
BF16/FP16 Performance2,250 TFLOPS (dense)
FP8 Performance4,500 TFLOPS (dense)
FP4 Performance9,000 TFLOPS (dense)
TDP1,000W
NVLink Bandwidth1.8 TB/s bidirectional

What these numbers mean in practice:

192 GB VRAM puts a full 70B parameter model in FP16 on a single GPU. H100 maxes out at 80 GB, meaning 70B at FP16 requires tensor-parallel sharding across two GPUs. The B200 eliminates that complexity entirely. For 100B+ models, the B200 gets you further before you need multi-GPU coordination.

8.0 TB/s memory bandwidth is roughly 2.4x the H100's 3.35 TB/s. For large-batch inference, memory bandwidth is often the real bottleneck, not raw TFLOPS. At 8 TB/s, the B200 sustains high throughput even at batch sizes where the H100 starts to stall.

9,000 TFLOPS FP4 (dense) is a capability H100 simply does not have. The 5th-generation Tensor Cores introduce native FP4 support, doubling effective throughput compared to FP8 for models that can tolerate the precision reduction. See the FP4 quantization guide for a deep dive on quality tradeoffs.

1,000W TDP requires high-density power delivery. Not a drop-in swap into air-cooled infrastructure, but lower than the B300's 1,400W, and manageable in most modern datacenters with proper PDU planning.

B200 vs H100: When the Upgrade Is Worth It

Here's how the B200 stacks up against the H100 SXM5 across the specs that matter most for LLM workloads. For a deeper look at H100 variants and H200 comparisons, see our H100 vs H200 benchmark guide.

SpecB200H100 SXM
ArchitectureBlackwellHopper
VRAM192 GB HBM3e80 GB HBM3
Memory Bandwidth8.0 TB/s3.35 TB/s
FP4 Dense (TFLOPS)9,000N/A
FP8 Dense (TFLOPS)4,500~1,979
FP16 Dense (TFLOPS)2,250~989
TDP1,000W700W
NVLink Bandwidth1.8 TB/s900 GB/s

All values are dense (non-sparse). H100 also publishes 2:4 structured-sparsity figures (~3,958 TFLOPS FP8 sparse) which require sparse weight patterns to activate.

Use B200 When:

Your models need more than 80 GB. Llama 3.1 70B at FP16 is ~140 GB. At FP8 it's ~70 GB, which barely fits an H100 with no room for KV cache at scale. B200's 192 GB handles this comfortably. For 100B+ models, B200 is often the only practical single-GPU option.

You're running high-traffic inference APIs. The 4,500 TFLOPS FP8 (dense) is roughly 2.3x the H100's 1,979 TFLOPS. More throughput per GPU means fewer GPUs to serve the same request volume.

FP4 inference is viable for your workload. At 9,000 TFLOPS FP4 (dense), the B200 doubles effective throughput compared to FP8. If your model tolerates FP4 quality (most production inference does), cost per token drops sharply.

Long context windows are required. At 128K context length on a 70B model, KV cache alone consumes 30-50 GB. An H100 with 80 GB total doesn't leave enough room for model weights plus KV cache at this scale. B200's 192 GB handles it.

Stick With H100 When:

Your models fit in 80 GB. For 7B to 34B models, the H100 provides strong throughput at $1.49-2.99/hr. No reason to pay more for VRAM you won't use.

Training is the primary workload. For distributed training, the H100 is still the best cost-per-FLOP option. The B200's FP4 advantage is inference-specific; training workloads don't see the same proportional gains.

Budget is the constraint. Eight H100s at $2/hr totals $16/hr. Eight B200s at $6.03/hr on-demand totals $48.24/hr on Spheron, roughly 3x the cost of H100. For cost-sensitive training or inference workloads where 80 GB VRAM is sufficient, H100 remains more economical at on-demand rates.

The Cost-Per-Token Math

Cost per hour is the wrong metric. What matters is cost per output token, which combines throughput and price. Based on benchmarks (see the Real-World Benchmarks section below):

GPUEst. Throughput (Llama 2 70B)Price/hrRelative Cost-per-Token
H100 SXM~3,000 tok/s$2.001.0x (baseline)
B200 (FP8, est.)~6,000 tok/s$6.03~1.51x (51% more expensive)
B200 (FP4, measured)~12,305 tok/s$6.03~0.74x (26% cheaper)

At on-demand pricing ($6.03/hr for B200 vs $2.00/hr for H100), FP4 inference is about 26% cheaper per token. FP8 at on-demand rates is roughly 51% more expensive per token than H100. The B200's on-demand cost advantage only applies when using FP4. At spot pricing ($2.25/hr), the per-token advantage grows substantially. The FP4 throughput figure is measured from MLPerf Inference v5.0 server mode results (Llama 2 70B). The FP8 figure is estimated based on TFLOPS and memory bandwidth ratios relative to H100. Treat estimates as directional and run your own benchmarks before committing.

B200 vs B300: What the Ultra Adds

The B300 (Blackwell Ultra) is the high-binned version of B200. Same base architecture, pushed further. For a full analysis of the B300 specifically, see our NVIDIA B300 Blackwell Ultra guide.

SpecB300B200H200H100
ArchitectureBlackwell UltraBlackwellHopperHopper
VRAM288 GB HBM3e192 GB HBM3e141 GB HBM3e80 GB HBM3
Memory Bandwidth8 TB/s8 TB/s4.8 TB/s3.35 TB/s
FP4 Dense (TFLOPS)14,0009,000N/AN/A
FP8 Dense (TFLOPS)7,0004,500~1,979~1,979
FP16 Dense (TFLOPS)3,5002,250~989~989
TDP1,400W1,000W700W700W
InterconnectNVLink 5 (1.8 TB/s)NVLink 5 (1.8 TB/s)NVLink 4 (900 GB/s)NVLink 4 (900 GB/s)
NetworkingConnectX-8 (1.6T)ConnectX-7 (800G)ConnectX-7 (800G)ConnectX-7 (800G)

All values are dense (non-sparse).

The B300 adds 50% more VRAM (288 GB vs 192 GB), 55.6% more FP4 compute (14,000 vs 9,000 TFLOPS), 40% higher TDP (1,400W vs 1,000W), and doubles inter-node networking via ConnectX-8. The networking upgrade matters specifically for multi-node training where inter-GPU communication is the bottleneck.

B200 is the right call for most teams today. Supply is more readily available, spot pricing starts at $2.25/hr on Spheron vs B300 on-demand at $3.50/hr on Spheron (B300 spot pricing is also available), and the software stack is more mature. B300 is worth considering when a single GPU must hold a 70B+ model without quantization, or when FP4 throughput is the primary constraint and the extra 5,000 TFLOPS dense FP4 makes a measurable difference for your workload.

FP4 and FP8: The Blackwell Precision Advantage

The B200's 5th-generation Tensor Cores introduce two precision capabilities that H100 and H200 cannot match: native hardware-accelerated FP4 inference, and significantly higher FP8 throughput.

Here's what changes with Blackwell:

FP4 (native): The B200 delivers 9,000 TFLOPS FP4 dense. H100 has no native FP4 hardware support. For inference, FP4 doubles effective throughput compared to FP8 on the same GPU. With 4-bit weights, four times as many parameters fit per unit of memory bandwidth compared to FP16, which directly translates to higher batch throughput for memory-bandwidth-bound workloads.

FP8 at scale: B200's 4,500 TFLOPS FP8 dense is roughly 2.3x the H100's 1,979 TFLOPS. This improvement applies to any FP8 workload without requiring FP4 quantization.

Framework support for FP4:

  • TensorRT-LLM 0.17+ with native FP4 kernels
  • vLLM with VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable
  • SGLang with FP4 quantization support
  • NVIDIA's published FP4 quantization toolkit for calibrating existing models

FP4 quality tradeoffs are model and task dependent. Most production inference tasks (classification, summarization, standard chat completion) show minimal degradation. Tasks requiring precise numerical outputs or fine-grained semantic distinctions may see measurable quality loss at FP4. Always validate quality on your specific use case before moving production traffic to FP4. See the FP4 quantization guide for benchmark data and a decision framework.

Real-World Benchmarks

These figures are sourced from MLPerf Training v4.1 and MLPerf Inference v5.0 results, plus NVIDIA published estimates where indicated:

BenchmarkB200 ResultComparison
GPT-3 (175B) Training~2x fastervs H100 SXM5 (MLPerf Training v4.1)
LLM Inference Throughput~12,305 tokens/sLlama 2 70B FP4, server mode (MLPerf Inference v5.0)
Mixture-of-Experts Inference~2.1x fastervs H200 (MLPerf Inference v5.0, Mixtral 8x7B)
Multi-Modal Model Training~2.8x fastervs H100 SXM5 (NVIDIA published estimate)
Stable Diffusion XL~1.6x faster1024x1024 inference vs H200 (server mode, MLPerf Inference v5.0)
Memory Capacity2.4x largervs H100 80GB

The inference throughput figure (~12,305 tok/s for Llama 2 70B FP4 in server mode, per MLPerf Inference v5.0) is the key number for inference cost calculations. Using H100 as a ~3,000 tok/s baseline, the B200 delivers roughly 4x throughput at a small hourly premium on Spheron, which translates directly to lower cost per token.

For VRAM-constrained workloads, the 2.4x memory advantage (192 GB vs 80 GB) often enables workloads that would require multi-GPU H100 configurations to run on a single B200, reducing system complexity and inter-GPU communication overhead.

For more detailed analysis across GPU generations, see the GPU selection guide for LLMs.

Cloud Pricing Across Providers

B200 pricing as of March 18, 2026. Prices can fluctuate over time based on availability of the GPUs.

ProviderPrice Per GPU/hrNotes
Spheron (spot)$2.25/hrLowest spot price available
Spheron (on-demand)$6.03/hrOn-demand, no commitment
RunPod$4.99/hrSecure Cloud
Nebius$5.50/hrOn-demand
Lambda Labsfrom $4.99/hrOn-demand
Azureest. $14.25/hrOn-demand
AWS (p6 instance)est. $14.25/hrOn-demand
Google Cloudest. $18.50/hrOn-demand

Prices are based on March 18, 2026 and can fluctuate over time based on availability. Check current GPU pricing for live rates.

For interruptible workloads like batch inference, experimentation, and development, Spheron's spot pricing at $2.25/hr is the lowest available B200 rate. For production inference and long-running training where interruption is not acceptable, on-demand pricing applies. Spheron's on-demand at $6.03/hr is competitive with other providers, and spot at $2.25/hr vs $14.25/hr on AWS is still a 6x difference for the same GPU. At scale that gap compounds quickly: 100 on-demand GPUs running continuously for a month is $434,160 on Spheron vs over $1,026,000 on AWS. The hardware is identical. The difference is infrastructure overhead, margin structure, and the provider's cost to procure and operate B200 capacity.

B200 pricing will compress over the next 6-12 months as supply ramps. The H100 followed the same pattern: $8/hr at launch down to under $3/hr by 2026. Spot pricing captures the B200's performance advantage before on-demand rates normalize.

Availability and Infrastructure Requirements

Availability: B200 is more widely available than B300 as of Q1 2026. Spheron has B200 inventory available as both Spot and Dedicated instances. Spot pricing gives access at reduced rates for interruptible workloads (batch inference, experimentation, non-time-sensitive jobs). Dedicated instances guarantee availability for production inference and long-running training.

Power: At 1,000W per GPU, an 8-GPU B200 system draws 7-8 kW from GPUs alone. This is substantial but lower than B300's 1,400W per GPU (11.2 kW for 8 GPUs). Most modern high-density datacenters can handle B200 TDP with appropriate PDU configuration. Unlike B300, liquid cooling is not strictly mandatory, though it's recommended for sustained peak load operation.

Cooling: Air cooling is viable for B200 at 1,000W per GPU in well-ventilated high-density racks. Liquid cooling (direct liquid cooling or rear-door heat exchangers) is recommended for dense deployments and will extend hardware longevity. This contrasts with B300 at 1,400W, where liquid cooling is effectively mandatory.

NVLink: B200 uses NVLink 5 at 1.8 TB/s per GPU bidirectional, the same spec as B300, and 2x over H100's NVLink 4 (900 GB/s). In an 8-GPU configuration, this provides near-linear scaling for distributed inference and training with minimal communication overhead. Specific NVLink features include:

  • NVLink 5.0 with 1.8 TB/s per GPU bidirectional bandwidth
  • Full NVSwitch connectivity for 8-GPU systems
  • Unified memory addressing across all GPUs in the system
  • Direct GPU-to-GPU communication without CPU involvement
  • Sub-100ns GPU-to-GPU latency

Software compatibility: The B200 uses the same CUDA toolchain as B300 (CUDA 12.x, cuDNN 9.x). Any code running on H100 or H200 runs on B200 without modification. FP4 capabilities require TensorRT-LLM 0.17+ or vLLM with FP4 support enabled.

Getting Started with B200 on Spheron

Spheron has B200 GPUs available as both Spot and Dedicated instances. Spot pricing at $2.25/hr is the lowest B200 spot rate currently available, suitable for batch inference, development, and interruptible workloads. On-demand pricing is $6.03/hr as of March 18, 2026, and can fluctuate based on GPU availability. Dedicated instances guarantee availability for production inference services and long-running training jobs where interruption would be costly.

Access is through Spheron's console, with B200 available alongside H200, H100, B300, and other GPUs from multiple providers. No contracts, no minimum commitments for spot instances. Check GPU pricing for current rates and availability.

For teams currently running H100 workloads: the migration path is straightforward. Docker containers and Python environments built for H100 run on B200 without changes. FP4 optimization is an optional additional step, not a requirement to benefit from the B200's higher memory and FP8 throughput.

For GPU selection across models and use cases, see our complete guide to GPU requirements for LLMs and the GPU selection guide.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.