Spheron GPU Catalog

Rent NVIDIA A100 80GB GPUs on Demand from $0.45/hr

80GB HBM2e, NVLink 600 GB/s, MIG, per-minute billing. Live in under 2 minutes.

At a glance

Renting an NVIDIA A100 80GB on Spheron starts at $0.45/hr per GPU per hour on dedicated (99.99% SLA), with interruptible spot instances cheaper still. There is no minimum commit, billing is per minute, and most instances are live inside two minutes. The A100 has 80GB of HBM2e and 2.0 TB/s of memory bandwidth, enough to train or fine-tune models up to about 30B parameters on a single card and serve quantized 70B models at production latency. SXM variants add 600 GB/s NVLink between GPUs for multi-GPU training. Hyperscaler on-demand A100 80GB pricing runs roughly $3.40 per GPU per hour on AWS p4de, $4.10 on Azure ND A100 v4, and about $5.00 on GCP a2-ultragpu.

GPU ArchitectureNVIDIA Ampere
VRAM80 GB HBM2e
Memory Bandwidth2.0 TB/s

Technical specifications

GPU Architecture
NVIDIA Ampere
VRAM
80 GB HBM2e
Memory Bandwidth
2.0 TB/s
Tensor Cores
432 (3rd Gen)
CUDA Cores
6,912
FP64 Performance
9.7 TFLOPS
FP32 Performance
19.5 TFLOPS
TF32 Performance
156 TFLOPS
FP16 Performance
312 TFLOPS
INT8 Performance
624 TOPS
NVLink Bandwidth
600 GB/s (SXM)
MIG Instances
Up to 7 per GPU
System RAM
100 GB DDR4
vCPUs
14 vCPUs
Storage
625 GB NVMe SSD
Form Factor
SXM4 / PCIe Gen4
TDP
400W SXM / 300W PCIe

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$0.45/hr-
Jarvislabs
$1.49/hr3.3x more expensive
TensorDock
$1.57/hr3.5x more expensive
Lambda Labs
$2.49/hr5.5x more expensive
AWS p4de
$3.43/hr7.6x more expensive
Azure ND A100 v4
$4.10/hr9.1x more expensive
Google Cloud
$5.07/hr11.3x more expensive
Custom & Reserved

Need More A100 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more A100 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the A100

Scenario 01

Pick the A100 if

You are training or fine-tuning a 7B to 30B parameter model, serving a quantized 70B model, or running classic workloads like BERT, ResNet, recommender systems, and RAPIDS analytics. The A100 is also the right call when you want the most mature ML stack on the market and are happy trading a bit of FP8 throughput for 40 to 60 percent lower hourly cost than H100.

Recommended fit
Scenario 02

Pick the H100 instead if

Your workload is FP8-native (Llama 3 / DeepSeek inference, FP8 training runs) or you need Transformer Engine speedups. H100 is roughly 2.5 to 3x faster on Tensor Core math and 1.7x more memory bandwidth, but it costs about 2x as much. If the speedup pays for itself, make the jump.

Recommended fit
Scenario 03

Pick the L40S instead if

You are running pure inference on sub-30B models, or batch image and video generation. L40S has 48GB GDDR6 and a much lower hourly cost, with strong FP8 and Ada Lovelace Tensor Cores. It has no NVLink, so it is not the right pick for multi-GPU training.

Recommended fit
Scenario 04

Pick the RTX 4090 instead if

You are doing development, small-scale fine-tuning, or sub-13B inference on a budget. The 4090 has 24GB VRAM and no NVLink, but it is the cheapest way to run modern AI stacks. Step up to A100 once you need more memory or multi-GPU scaling.

Recommended fit

Ideal use cases

Use case / 01
🤖

LLM training and fine-tuning

Train or fine-tune models in the 7B to 30B range with mixed precision. FSDP and DeepSpeed ZeRO scale cleanly across 8x A100 with NVLink, and LoRA / QLoRA bring 70B within reach on a single card.

Continued pre-training on Llama 3.1 8B / Mistral 7BSupervised fine-tunes on Qwen 14B, CodeLlama 13BLoRA and QLoRA fine-tunes of Llama 2/3 70BMulti-GPU ZeRO-3 training up to 30B parameters
Use case / 02

Production LLM inference

Serve models at steady latency with vLLM, TensorRT-LLM, or Triton. INT8 and FP16 paths are well optimized, and MIG lets you carve one A100 into up to 7 isolated inference slots.

Llama 3.1 8B / Mistral 7B at high concurrencyQuantized Llama 2 70B (INT4) serving on single A100Multi-model serving via MIG partitioningBERT / T5 embedding and reranker pipelines
Use case / 03
🎯

Classic ML and computer vision

The A100 still holds the line on computer vision and recommender workloads that predate the LLM wave. Mature CUDA kernels, stable ecosystem, predictable throughput.

ResNet, EfficientNet, ViT, DETR trainingRecommender systems (DLRM, two-tower)Speech recognition and TTS pipelinesAutoML, NAS, and hyperparameter sweeps
Use case / 04
📊

GPU data analytics and HPC

RAPIDS, cuDF, cuGraph, and GPU-accelerated SQL engines all target A100 first. FP64 throughput is 9.7 TFLOPS, enough for most simulation work that does not need Hopper-class double precision.

ETL and feature engineering with cuDFLarge-scale graph analytics with cuGraphMolecular dynamics and bioinformaticsSignal processing and time-series analytics

Performance benchmarks

BERT time-to-solution (TF32)
up to 5x faster
vs V100 FP32
TF32 cross-network speedup
~2.6x avg
23 networks vs V100 FP32
Llama 2 70B inference (INT4)
fits single A100 80GB
~35 GB weights
FP16 Tensor throughput
312 TFLOPS
624 TFLOPS with sparsity
TF32 Tensor throughput
156 TFLOPS
~10x V100 FP32 (15.7 TFLOPS)
Memory bandwidth
2.0 TB/s
vs 1.55 TB/s on A100 40GB

Serve Llama 3.1 8B on an A100 in under 2 minutes

Spin up a Spheron A100 80GB, pull the vLLM image, and serve Llama 3.1 8B with an OpenAI-compatible API. Point any OpenAI SDK client at the endpoint and you are done.

bash
Spheron
# 1. Provision an A100 80GB from the Spheron CLI (or use the dashboard)spheron deploy --gpu a100-80gb --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3.1 8B Instructvllm serve meta-llama/Llama-3.1-8B-Instruct \  --max-model-len 8192 \  --gpu-memory-utilization 0.92 \  --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{    "model": "meta-llama/Llama-3.1-8B-Instruct",    "messages": [{"role": "user", "content": "Summarize MIG partitioning on A100."}]  }'

For 70B inference, add --tensor-parallel-size 2 and rent 2x A100 80GB with NVLink. For multi-node training, contact us for InfiniBand-connected clusters.

Interconnect fabric

Multi-GPU A100 with NVLink and InfiniBand

A100 SXM4 nodes on Spheron link 8 GPUs with NVLink at 600 GB/s intra-node, and multi-node jobs use 200 Gb/s HDR InfiniBand with GPUDirect RDMA. That is the same fabric NVIDIA ships in DGX A100 systems, so PyTorch DDP, DeepSpeed ZeRO, and Megatron-LM run at close to linear scaling.

01600 GB/s NVLink between GPUs inside a node
02200 Gb/s HDR InfiniBand across nodes
03GPUDirect RDMA for zero-copy GPU-to-GPU transfers
04NCCL pre-tuned for A100 topology
05MIG support for splitting into up to 7 instances per GPU
068x A100 per node, multi-node clusters on request
07Tested with PyTorch DDP, DeepSpeed ZeRO-3, and Megatron-LM
08Both SXM4 and PCIe Gen4 form factors available
Scale

Need a custom multi-node cluster or reserved capacity?

A100 vs alternatives

Related resources

Frequently asked questions

How much does it cost to rent an A100 GPU?

On Spheron the A100 80GB starts at $0.45/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with interruptible spot instances cheaper still. There is no minimum commit and billing is per minute. For reference, Lambda Labs runs ~$2.49/hr, AWS p4de ~$3.43/hr per GPU, Azure ND A100 v4 ~$4.10/hr per GPU, and Google Cloud a2-ultragpu around $5/hr.

What is the cheapest way to rent an A100?

Spot instances on Spheron are the cheapest path, often 50 to 70 percent below the dedicated rate. The trade-off is that the instance can be reclaimed when demand spikes, so checkpoint every 15 to 30 minutes and treat spot as a fit for fault-tolerant training, batch jobs, and experimentation. For steady production serving, stay on dedicated (99.99% SLA, non-interruptible). Both are on-demand tiers with per-minute billing.

Can I rent an A100 by the hour?

Yes. Spheron bills per minute with no minimum. A one-hour benchmark costs you one hour. No contracts, no reserved-instance lock-in on dedicated or spot, and no commit fees.

How fast can I deploy an A100 instance?

Most A100 instances are live in 45 to 90 seconds. Hardware is pre-warmed, so provisioning behaves more like a container start than a VM boot. If your Docker image is ready, you can be running a training script inside two minutes of hitting deploy.

What is the difference between A100 SXM and A100 PCIe?

SXM4 is the higher-power variant (400W) with NVLink between GPUs at 600 GB/s, which matters for multi-GPU training and model parallelism. PCIe is lower-power (300W) and easier to mix with standard servers, but has no NVLink. Pick SXM for distributed training or 70B FP16 inference across 2+ GPUs. Pick PCIe for single-GPU inference or data processing.

What is the difference between A100 40GB and 80GB?

The 80GB variant doubles VRAM and bumps memory bandwidth from 1.55 TB/s to 2.0 TB/s. That matters for larger batch sizes, long-context inference, and 70B-class quantized models. Spheron defaults to the 80GB SKU because the memory headroom usually pays for itself.

Does A100 support Multi-Instance GPU (MIG)?

Yes. A single A100 splits into up to 7 isolated MIG instances, each with dedicated compute, memory, and bandwidth. MIG is perfect for running multiple small inference workloads on one card without noisy-neighbor effects. It is exposed on both SXM and PCIe variants.

Do you support multi-node A100 clusters with InfiniBand?

Yes. Spheron offers 8x A100 per node with NVLink, and multi-node clusters connected by 200 Gb/s HDR InfiniBand with GPUDirect RDMA. Clusters are tested with PyTorch DDP, DeepSpeed ZeRO-3, and Megatron-LM. Larger configurations are available on request.

What regions are A100s available in?

A100 capacity is online across North America, Europe, and Asia, sourced from data center partners. Availability shifts with demand and the dashboard shows live capacity per region.

What frameworks and drivers come pre-installed?

PyTorch, TensorFlow, JAX, and the major serving stacks (vLLM, TensorRT-LLM, Triton, SGLang) all ship in the default images. CUDA 12.6+, cuDNN, NCCL, and RAPIDS are pre-tuned for A100. You can also bring your own Docker image.

Can A100 handle 70B-parameter models?

For inference, yes. A 70B model in INT4 (~35GB) runs on a single A100 80GB. At INT8 you need two A100 80GBs with tensor parallelism. FP16 training or inference at 70B requires 2+ A100 80GB with NVLink. Sweet spot for the A100 remains 7B to 30B parameters.

Is the A100 worth it over the H100?

If your workload is FP8-native or memory-bandwidth-bound, H100 pays for itself. If you are doing classic training or fine-tuning up to 30B parameters, or inference on models that fit in 80GB without FP8, A100 usually wins on dollars per token. Start on A100, move to H100 when the speedup justifies the cost.

Do you offer enterprise SLAs and dedicated support for A100?

For 100+ GPU deployments and production-critical workloads, Spheron offers dedicated Slack or Discord support, sourcing assistance, and SLA-backed instances. Smaller deployments are self-serve through the dashboard.

Talk to our team

How does A100 pricing on Spheron compare to AWS, GCP, and Azure?

For the same A100 80GB hardware, Spheron is meaningfully cheaper than AWS p4de, Azure ND A100 v4, and GCP a2-ultragpu on-demand. As of April 2026, hyperscaler on-demand A100 80GB pricing runs roughly $3.43/hr per GPU on AWS p4de, $4.10/hr on Azure ND A100 v4, and about $5/hr on GCP. Spheron starts at $0.45/hr. Same silicon, different pricing model.

Also consider