Engineering

NVIDIA A100 Deployment Guide: SXM vs PCIe, Spot vs Dedicated, and MIG

Back to BlogWritten by Mitrasish, Co-founderFeb 3, 2026
NVIDIA A100GPU DeploymentMIGNVLinkSXM vs PCIeAmpere Architecture
NVIDIA A100 Deployment Guide: SXM vs PCIe, Spot vs Dedicated, and MIG

If you are picking an A100 configuration for a real workload, the hard decisions are not "do I need an A100." They are SXM or PCIe, spot or dedicated, MIG or whole-GPU, and how many cards per node. This guide walks through each of those trade-offs with the numbers and failure modes we see most often in production.

For live A100 pricing and deployment, see the A100 rental page. For a broader architectural comparison, start with our A100 vs V100 analysis or the best NVIDIA GPUs for LLMs framework.

Why the A100 is Still the Default in 2026

The A100 was NVIDIA's first GPU designed specifically for large-scale AI, not just graphics or HPC with AI added on top. Built on the Ampere architecture, it introduced features that reshaped how AI workloads run in production: high-bandwidth HBM2e memory, third-generation Tensor Cores, and Multi-Instance GPU support.

Most importantly, the software ecosystem around A100 is mature. Frameworks like PyTorch, TensorFlow, JAX, TensorRT, and RAPIDS have been tuned for years on this hardware. Engineers know how A100 behaves under sustained load. That reliability still matters more than raw benchmarks.

For many teams, A100 is the most cost-effective way to run serious AI workloads without paying a premium for bleeding-edge hardware they may not fully utilize.

A100 Technical Specifications

SpecificationA100 80GB SXMA100 40GB SXMA100 80GB PCIe
ArchitectureAmpere (7nm)Ampere (7nm)Ampere (7nm)
CUDA Cores6,9126,9126,912
Tensor Cores432 (3rd Gen)432 (3rd Gen)432 (3rd Gen)
VRAM80 GB HBM2e40 GB HBM2e80 GB HBM2e
Memory Bandwidth2,039 GB/s1,555 GB/s2,039 GB/s
FP32 (TFLOPS)19.519.519.5
TF32 Tensor (TFLOPS)156156156
FP16 Tensor (TFLOPS)312312312
FP16 Sparsity (TFLOPS)624624624
INT8 Tensor (TOPS)624624624
NVLink Bandwidth600 GB/s600 GB/sN/A
PCIeGen 4Gen 4Gen 4
MIG InstancesUp to 7Up to 7Up to 7
TDP400W400W300W

The A100 delivers 312 TFLOPS of FP16 Tensor performance, with up to 624 TFLOPS when structural sparsity is enabled. TF32 is particularly important; it allows developers to run FP32 models with much higher performance without rewriting code, delivering major speedups while preserving accuracy. This is one of the reasons A100 saw rapid adoption across existing AI codebases.

Key Architecture Features

Ampere Tensor Cores

At the heart of A100 are NVIDIA's third-generation Tensor Cores, which accelerate matrix operations that dominate AI training and inference. A100 supports multiple numerical formats: FP32, TF32, FP16, BF16, INT8, and INT4, allowing the same GPU to handle training, fine-tuning, and inference efficiently.

HBM2e Memory and Bandwidth

The A100 80GB uses HBM2e memory delivering nearly 2 TB/s of bandwidth. Many AI workloads are memory-bound, not compute-bound; large models, large batch sizes, and long context windows all benefit directly from higher memory capacity and bandwidth. With A100 80GB, more of the model and data stays resident on the GPU instead of spilling into system memory.

NVLink Interconnect

A100 SXM variants support NVLink, NVIDIA's high-speed GPU-to-GPU interconnect. NVLink allows multiple GPUs to communicate at up to 600 GB/s, far faster than PCIe alone. This is critical for multi-GPU training, model parallelism, and large inference clusters. When GPUs exchange gradients or activations quickly; scaling efficiency improves and training time drops.

Multi-Instance GPU (MIG)

MIG allows a single A100 GPU to be split into up to seven isolated GPU instances, each with its own memory, compute, and bandwidth allocation. These instances behave like independent GPUs. This is extremely useful for inference workloads, shared environments, and teams running multiple smaller jobs; instead of underutilizing a full GPU, MIG lets teams pack workloads efficiently while maintaining isolation.

Model Capacity on A100

ModelParametersVRAM (FP16)A100 40GBA100 80GB
Mistral 7B7B14 GBYesYes
Llama 3.1 8B8B16 GBYesYes
Llama 2 13B13B26 GBYes (tight)Yes
Mixtral 8x7B (INT8)47B24 GBYes (tight)Yes
Llama 2 70B (INT4)70B35 GBYes (tight)Yes
Llama 2 70B (FP16)70B140 GBNoNo (2 GPU)
DeepSeek V3 (INT4)671B~170 GBNoNo (4+ GPU)

A100 80GB handles models up to roughly 30B parameters in FP16 for training (with optimizer states). For inference, quantized 70B models fit on a single A100 80GB. The sweet spot is 7B to 30B parameter models where the A100 provides ample memory without requiring multi-GPU sharding. Check our best NVIDIA GPUs for LLMs guide to see if A100 matches your specific model requirements.

Choosing an A100 Configuration

A100 comes in four configurations you will actually encounter on GPU clouds: 40GB or 80GB memory, SXM4 or PCIe form factor. The right pick depends on workload shape and blast radius.

A100 80GB SXM4 is the default for multi-GPU training, large-batch fine-tuning, and anything that needs NVLink. Picks up to 600 GB/s between GPUs, 400W per card, 2.0 TB/s memory bandwidth. This is what ships in DGX A100 and HGX A100 reference designs.

A100 80GB PCIe is the right call for single-GPU inference, data processing, or mixing GPUs into standard servers without the thermal and power budget of SXM. 300W, no NVLink, same compute and memory subsystem.

A100 40GB (SXM or PCIe) still exists but is rarely the right pick in 2026. The 80GB variant has the same compute with 2x the memory and ~30 percent more bandwidth. The only reason to choose 40GB is if it is meaningfully cheaper per hour on a workload that genuinely fits.

For current per-hour pricing across these configurations, see the A100 rental page.

Understanding SXM vs PCIe

This distinction matters more than most providers explain.

A100 SXM offers higher memory bandwidth (2,039 GB/s) and NVLink support (600 GB/s), which improves performance for multi-GPU training and memory-intensive workloads. PCIe A100 trades some of that performance for lower power draw (300W vs 400W) and easier deployment in standard server chassis.

If your workload involves distributed training, large batch sizes, or heavy GPU-to-GPU communication, SXM is the better choice. If you focus on single-GPU inference or data processing, PCIe often delivers better cost efficiency.

Spheron exposes this difference clearly so teams can choose intentionally.

Spot vs Dedicated: Picking the Right Tier

On Spheron, on-demand A100 comes in two tiers: dedicated (99.99% SLA, non-interruptible) and spot (cheaper, interruptible). Spot is cheaper because it gets reclaimed when demand spikes. Dedicated is more expensive because it does not. The choice is workload-shaped.

Use spot for: training jobs with frequent checkpointing (every 15 to 30 minutes), batch inference, data preprocessing, hyperparameter sweeps, anything you can safely resume. The cost savings are large enough that even losing an hour of progress once a day usually still comes out ahead.

Use dedicated for: production serving, jobs where re-running costs more than the instance, anything with a deadline. The predictability is worth the premium.

A practical pattern that works well: run most of your training on spot, keep a small dedicated tier as a failover for the last 10 to 20 percent of any long run where rerun cost would exceed the spot savings.

Common Use Cases

Training and fine-tuning: A100 handles models in the 7B to 30B parameter range comfortably. It supports mixed-precision training (TF32, FP16, BF16) and FSDP for larger models. Teams use it for continued pre-training, supervised fine-tuning, and LoRA/QLoRA.

Production inference: A100 delivers stable latency and high throughput for production serving. MIG allows teams to isolate workloads cleanly, which improves utilization and reduces operational complexity.

Data analytics: RAPIDS, GPU-accelerated SQL engines, and cuDF benefit directly from A100's memory bandwidth and CUDA ecosystem. Teams running data preprocessing pipelines alongside training see significant speedups.

Research and experimentation: Startups and researchers use spot A100 instances to experiment quickly without committing to expensive long-term infrastructure. The mature ecosystem means most papers and codebases "just work" on A100.

When A100 Is the Right Choice

A100 makes sense when you want reliable AI compute at a reasonable cost. It fits teams that value stability, ecosystem maturity, and predictable performance. Power draw is lower than H100-class GPUs (400W vs 700W), which keeps total cost of ownership closer to flat. See our GPU cost optimization playbook for practical tactics.

If your workloads are hitting memory bandwidth limits or need models beyond 70B in FP16, H100 or H200 may be a better fit. Otherwise, A100 remains one of the smartest choices in the GPU market. Compare options in our renting GPUs guide and the best NVIDIA GPUs for LLMs framework.


Ready to deploy? Check live per-hour A100 pricing and configurations on the A100 rental page or see all GPU pricing.

Rent A100 on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.