Rent NVIDIA GH200 GPUs on Demand from $1.88/hr
Grace Hopper Superchip with 96GB HBM3 + 432GB LPDDR5X unified memory over NVLink-C2C.
You can rent an NVIDIA GH200 Grace Hopper Superchip on Spheron starting at $1.88/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with spot pricing cheaper still. Per-minute billing, no long-term contracts, and instances deploy in under 2 minutes across data center partners in multiple regions. Each module ships with 96GB HBM3 on the Hopper GPU plus 432GB LPDDR5X on the Grace ARM CPU, connected by 900 GB/s NVLink-C2C. That gives you ~528GB of cache-coherent unified memory in a single socket, eliminating the PCIe bottleneck for inference workloads with large KV caches, graph workloads with billion-edge datasets, and genomics pipelines that spill beyond GPU VRAM.
Technical specifications
Pricing comparison
| Provider | Price/hr | Savings |
|---|---|---|
SpheronYour price | $1.88/hr | - |
Lambda Labs | $1.99/hr | 1.1x more expensive |
CoreWeave | $6.50/hr | 3.5x more expensive |
Need More GH200 Than What's Listed?
Reserved Capacity
Commit to a duration, lock in availability and better rates
Custom Clusters
8 to 512+ GPUs, specific hardware, InfiniBand configs on request
Supplier Matchmaking
Spheron sources from its certified data center network, negotiates pricing, handles setup
Need more GH200 capacity? Tell us your requirements and we'll source it from our certified data center network.
Typical turnaround: 24–48 hours
When to pick the GH200
Pick GH200 if
Your workload needs memory beyond 96GB of HBM but isn't worth paying B200/H200 rates, or your model spills KV cache onto system memory and you need coherent access. Also the sweet spot for graph neural networks, genomics pipelines, and recommendation models with huge embedding tables.
Pick H100 80GB instead if
Your model fits in 80GB HBM3 and you want maximum multi-GPU training throughput with NVLink + InfiniBand. H100 SXM5 is the standard for 8-way tensor parallelism and pre-training runs where CPU memory isn't in the critical path.
Pick H200 141GB instead if
You need more GPU-side HBM than 96GB, but don't need the unified memory architecture. H200 gives you 141GB HBM3e at 4.8 TB/s, a cleaner fit for 70B+ inference without going ARM.
Pick B200 192GB instead if
You need Blackwell FP4 Transformer Engine, 8 TB/s bandwidth, and the latest NVLink 5. B200 is the choice for 200B+ model training, and its dedicated HBM3e beats GH200's unified memory for bandwidth-bound workloads.
Ideal use cases
AI Inference & Serving
Leverage the massive 432GB unified memory pool to serve large AI models with enormous KV caches, enabling high-throughput inference without CPU-GPU data transfer overhead.
Large Dataset Processing
Utilize the 432GB unified memory architecture to process datasets that don't fit in GPU VRAM alone, eliminating costly data transfers between CPU and GPU memory.
Scientific Computing & HPC
Combine the energy-efficient ARM Grace CPU with the powerful Hopper GPU for high-performance computing workloads.
Edge AI & Autonomous Systems
Deploy the compact superchip form factor for edge AI applications requiring powerful inference in a single integrated module.
Performance benchmarks
Serve Llama 3.1 70B with a massive KV cache on GH200
The GH200's 96GB HBM3 holds Llama 3.1 70B at FP8 (~70GB), and the 432GB LPDDR5X CPU memory over NVLink-C2C lets you extend the effective working set far beyond what a pure HBM card can hold.
# SSH into your GH200 instance (ARM64 / aarch64)ssh ubuntu@<instance-ip> # Install vLLM for ARM with CUDA 12.4+pip install vllm # Launch Llama 3.1 70B with FP8, long contextvllm serve meta-llama/Llama-3.1-70B-Instruct \ --quantization fp8 \ --max-model-len 32768 \ --gpu-memory-utilization 0.9 \ --enforce-eager # Sanity checkcurl http://localhost:8000/v1/modelsMost major ML frameworks (PyTorch, JAX, vLLM) have native ARM64 wheels. If you hit a package without an ARM build, NVIDIA's NGC containers cover the common cases.
NVLink-C2C Configuration
The GH200 Grace Hopper Superchip features NVLink-C2C (Chip-to-Chip) interconnect providing 900 GB/s bidirectional coherent bandwidth between the Grace CPU and Hopper GPU, eliminating the traditional PCIe bottleneck and enabling seamless unified memory access across the entire module.
Need a custom multi-node cluster or reserved capacity? Talk to us about topology, regions, and committed pricing.
Related resources
NVIDIA GH200 Grace Hopper Superchip: Architecture and Performance Guide
Deep dive into GH200 architecture, unified memory, ARM-based Grace CPU, and ideal use cases.
Best NVIDIA GPUs for LLMs: Complete Ranking Guide
How the GH200 ranks against H100, H200, and A100 for large language model workloads.
GPU Memory Requirements for LLMs: VRAM Calculator and Sizing Guide
Calculate exactly how much VRAM you need, and why GH200's 96GB + 432GB unified memory matters.
Frequently asked questions
What makes GH200 different from H100?
The GH200 Grace Hopper Superchip integrates an ARM-based Grace CPU and a Hopper GPU into a single unified architecture connected via NVLink-C2C. Unlike H100 which relies on PCIe for CPU-GPU communication, GH200 provides 900 GB/s coherent interconnect bandwidth and 432GB of shared LPDDR5X memory accessible by both CPU and GPU. This makes GH200 ideal for workloads where data doesn't fit in GPU VRAM alone.
What is NVLink-C2C?
NVLink-C2C (Chip-to-Chip) is NVIDIA's high-bandwidth coherent interconnect that connects the Grace CPU and Hopper GPU within the GH200 module. It provides 900 GB/s bidirectional bandwidth, which is 7x faster than PCIe Gen5. The coherent nature means both CPU and GPU can access each other's memory seamlessly with hardware-managed cache coherency, eliminating the traditional PCIe bottleneck.
Is GH200 good for LLM inference?
Yes, the GH200 is excellent for LLM inference. With 96GB of HBM3 GPU memory plus 432GB of LPDDR5X CPU memory accessible via NVLink-C2C, you can maintain massive KV caches for large context windows. The unified memory architecture allows models to seamlessly spill over from GPU to CPU memory without the PCIe bottleneck, making it ideal for serving large language models with long context lengths.
What workloads benefit from unified memory?
Workloads that benefit most from GH200's unified memory are those where data doesn't fit in GPU VRAM alone. This includes large graph neural networks with billion-edge graphs, genomics pipelines processing entire genomes, recommendation models with huge embedding tables, scientific simulations with large state spaces, and any AI workload that traditionally requires expensive CPU-GPU data transfers.
How does the ARM CPU affect compatibility?
The Grace CPU uses ARM Neoverse V2 architecture. Most major ML frameworks including PyTorch, TensorFlow, and JAX have full ARM support and run natively. CUDA code runs on the Hopper GPU unchanged. Some CPU-dependent tools compiled for x86 may need recompilation for ARM, but NVIDIA provides optimized ARM containers and libraries. The vast majority of AI workloads run seamlessly on GH200.
Can I use GH200 for training?
Yes, the GH200 contains the same Hopper GPU architecture as the H100 with 96GB HBM3 memory. It's particularly well-suited for training models that require large memory, such as models with massive embedding tables or long sequences. However, for pure multi-GPU training throughput where InfiniBand scaling is critical, H100 with InfiniBand networking may be more cost-effective.
What's the minimum rental period?
There's no minimum! Spheron charges by the hour with per-minute billing granularity. Rent a GH200 for just an hour to test your workload, or keep it running for months. You only pay for what you use with no long-term contracts or commitments.
How does GH200 compare on price-performance?
GH200 is strongest when your workload actually uses the unified memory pool. For memory-bound LLM serving with very long contexts, graph neural networks on billion-edge datasets, or genomics pipelines with 100GB+ intermediate buffers, the combined 528GB CPU+GPU memory eliminates PCIe data copies and is often faster per dollar than stacking multiple H100s. For workloads that fit entirely in 80GB HBM, H100 SXM5 is cheaper per hour.
What regions are GH200 available?
GH200 GPUs are currently available in US, Europe, and Canada regions. We're continuously expanding capacity and regions. Check the Spheron app for specific availability or contact our team for region-specific requirements.
Do you offer support?
Our platform is plug-and-play for standard deployments. For 100+ GPU clusters, you get dedicated support via Slack or Discord, plus sourcing assistance. Enterprise customers get dedicated support channels and SLA guarantees.
Book a call with our team →What's the difference between dedicated and spot GH200 instances?
Dedicated GH200 instances are non-interruptible, run on a 99.99% SLA, and bill per-minute at the on-demand rate. Spot instances run on spare capacity at meaningfully lower rates but can be preempted when dedicated demand rises. Use spot for fault-tolerant workloads: batch inference, QLoRA fine-tuning with checkpointing every 15-30 minutes, or graph analytics jobs. Use dedicated for customer-facing inference endpoints or any job where an interruption would cost more than the savings. Both tiers live in the same control plane, so you can mix them across a project.