NVIDIA A100 Deployment Guide: SXM vs PCIe, Spot vs Dedicated, and MIG

If you are picking an A100 configuration for a real workload, the hard decisions are not "do I need an A100." They are SXM or PCIe, spot or dedicated, MIG or whole-GPU, and how many cards per node. This guide walks through each of those trade-offs with the numbers and failure modes we see most often in production.

For live A100 pricing and deployment, see the A100 rental page. For a broader architectural comparison, start with our A100 vs V100 analysis or the best NVIDIA GPUs for LLMs framework.

Why the A100 is Still the Default in 2026

The A100 was NVIDIA's first GPU designed specifically for large-scale AI, not just graphics or HPC with AI added on top. Built on the Ampere architecture, it introduced features that reshaped how AI workloads run in production: high-bandwidth HBM2e memory, third-generation Tensor Cores, and Multi-Instance GPU support.

Most importantly, the software ecosystem around A100 is mature. Frameworks like PyTorch, TensorFlow, JAX, TensorRT, and RAPIDS have been tuned for years on this hardware. Engineers know how A100 behaves under sustained load. That reliability still matters more than raw benchmarks.

For many teams, A100 is the most cost-effective way to run serious AI workloads without paying a premium for bleeding-edge hardware they may not fully utilize.

A100 Technical Specifications

Specification	A100 80GB SXM	A100 40GB SXM	A100 80GB PCIe
Architecture	Ampere (7nm)	Ampere (7nm)	Ampere (7nm)
CUDA Cores	6,912	6,912	6,912
Tensor Cores	432 (3rd Gen)	432 (3rd Gen)	432 (3rd Gen)
VRAM	80 GB HBM2e	40 GB HBM2e	80 GB HBM2e
Memory Bandwidth	2,039 GB/s	1,555 GB/s	2,039 GB/s
FP32 (TFLOPS)	19.5	19.5	19.5
TF32 Tensor (TFLOPS)	156	156	156
FP16 Tensor (TFLOPS)	312	312	312
FP16 Sparsity (TFLOPS)	624	624	624
INT8 Tensor (TOPS)	624	624	624
NVLink Bandwidth	600 GB/s	600 GB/s	N/A
PCIe	Gen 4	Gen 4	Gen 4
MIG Instances	Up to 7	Up to 7	Up to 7
TDP	400W	400W	300W

The A100 delivers 312 TFLOPS of FP16 Tensor performance, with up to 624 TFLOPS when structural sparsity is enabled. TF32 is particularly important; it allows developers to run FP32 models with much higher performance without rewriting code, delivering major speedups while preserving accuracy. This is one of the reasons A100 saw rapid adoption across existing AI codebases.

Key Architecture Features

Ampere Tensor Cores

At the heart of A100 are NVIDIA's third-generation Tensor Cores, which accelerate matrix operations that dominate AI training and inference. A100 supports multiple numerical formats: FP32, TF32, FP16, BF16, INT8, and INT4, allowing the same GPU to handle training, fine-tuning, and inference efficiently.

HBM2e Memory and Bandwidth

The A100 80GB uses HBM2e memory delivering nearly 2 TB/s of bandwidth. Many AI workloads are memory-bound, not compute-bound; large models, large batch sizes, and long context windows all benefit directly from higher memory capacity and bandwidth. With A100 80GB, more of the model and data stays resident on the GPU instead of spilling into system memory.

NVLink Interconnect

A100 SXM variants support NVLink, NVIDIA's high-speed GPU-to-GPU interconnect. NVLink allows multiple GPUs to communicate at up to 600 GB/s, far faster than PCIe alone. This is critical for multi-GPU training, model parallelism, and large inference clusters. When GPUs exchange gradients or activations quickly; scaling efficiency improves and training time drops.

Multi-Instance GPU (MIG)

MIG allows a single A100 GPU to be split into up to seven isolated GPU instances, each with its own memory, compute, and bandwidth allocation. These instances behave like independent GPUs. This is extremely useful for inference workloads, shared environments, and teams running multiple smaller jobs; instead of underutilizing a full GPU, MIG lets teams pack workloads efficiently while maintaining isolation.

Model Capacity on A100

Model	Parameters	VRAM (FP16)	A100 40GB	A100 80GB
Mistral 7B	7B	14 GB	Yes	Yes
Llama 3.1 8B	8B	16 GB	Yes	Yes
Llama 2 13B	13B	26 GB	Yes (tight)	Yes
Mixtral 8x7B (INT8)	47B	24 GB	Yes (tight)	Yes
Llama 2 70B (INT4)	70B	35 GB	Yes (tight)	Yes
Llama 2 70B (FP16)	70B	140 GB	No	No (2 GPU)
DeepSeek V3 (INT4)	671B	~170 GB	No	No (4+ GPU)

A100 80GB handles models up to roughly 30B parameters in FP16 for training (with optimizer states). For inference, quantized 70B models fit on a single A100 80GB. The sweet spot is 7B to 30B parameter models where the A100 provides ample memory without requiring multi-GPU sharding. Check our best NVIDIA GPUs for LLMs guide to see if A100 matches your specific model requirements.

Choosing an A100 Configuration

A100 comes in four configurations you will actually encounter on GPU clouds: 40GB or 80GB memory, SXM4 or PCIe form factor. The right pick depends on workload shape and blast radius.

A100 80GB SXM4 is the default for multi-GPU training, large-batch fine-tuning, and anything that needs NVLink. Picks up to 600 GB/s between GPUs, 400W per card, 2.0 TB/s memory bandwidth. This is what ships in DGX A100 and HGX A100 reference designs.

A100 80GB PCIe is the right call for single-GPU inference, data processing, or mixing GPUs into standard servers without the thermal and power budget of SXM. 300W, no NVLink, same compute and memory subsystem.

A100 40GB (SXM or PCIe) still exists but is rarely the right pick in 2026. The 80GB variant has the same compute with 2x the memory and ~30 percent more bandwidth. The only reason to choose 40GB is if it is meaningfully cheaper per hour on a workload that genuinely fits.

For current per-hour pricing across these configurations, see the A100 rental page.

Understanding SXM vs PCIe

This distinction matters more than most providers explain.

A100 SXM offers higher memory bandwidth (2,039 GB/s) and NVLink support (600 GB/s), which improves performance for multi-GPU training and memory-intensive workloads. PCIe A100 trades some of that performance for lower power draw (300W vs 400W) and easier deployment in standard server chassis.

If your workload involves distributed training, large batch sizes, or heavy GPU-to-GPU communication, SXM is the better choice. If you focus on single-GPU inference or data processing, PCIe often delivers better cost efficiency.

Spheron exposes this difference clearly so teams can choose intentionally.

Spot vs Dedicated: Picking the Right Tier

On Spheron, on-demand A100 comes in two tiers: dedicated (99.99% SLA, non-interruptible) and spot (cheaper, interruptible). Spot is cheaper because it gets reclaimed when demand spikes. Dedicated is more expensive because it does not. The choice is workload-shaped.

Use spot for: training jobs with frequent checkpointing (every 15 to 30 minutes), batch inference, data preprocessing, hyperparameter sweeps, anything you can safely resume. The cost savings are large enough that even losing an hour of progress once a day usually still comes out ahead.

Use dedicated for: production serving, jobs where re-running costs more than the instance, anything with a deadline. The predictability is worth the premium.

A practical pattern that works well: run most of your training on spot, keep a small dedicated tier as a failover for the last 10 to 20 percent of any long run where rerun cost would exceed the spot savings.

Common Use Cases

Training and fine-tuning: A100 handles models in the 7B to 30B parameter range comfortably. It supports mixed-precision training (TF32, FP16, BF16) and FSDP for larger models. Teams use it for continued pre-training, supervised fine-tuning, and LoRA/QLoRA.

Production inference: A100 delivers stable latency and high throughput for production serving. MIG allows teams to isolate workloads cleanly, which improves utilization and reduces operational complexity. If inference cost-per-token is your primary metric, see NVIDIA L40S vs A100 for a direct FP8 vs BF16 throughput comparison.

Data analytics: RAPIDS, GPU-accelerated SQL engines, and cuDF benefit directly from A100's memory bandwidth and CUDA ecosystem. Teams running data preprocessing pipelines alongside training see significant speedups.

Research and experimentation: Startups and researchers use spot A100 instances to experiment quickly without committing to expensive long-term infrastructure. The mature ecosystem means most papers and codebases "just work" on A100.

When A100 Is the Right Choice

A100 makes sense when you want reliable AI compute at a reasonable cost. It fits teams that value stability, ecosystem maturity, and predictable performance. Power draw is lower than H100-class GPUs (400W vs 700W), which keeps total cost of ownership closer to flat. See our GPU cost optimization playbook for practical tactics.

If your workloads are hitting memory bandwidth limits or need models beyond 70B in FP16, H100 or H200 may be a better fit. Otherwise, A100 remains one of the smartest choices in the GPU market. Compare options in our GPU renting on Spheron and the best NVIDIA GPUs for LLMs framework.

Ready to deploy? Check live per-hour A100 pricing and configurations on the A100 rental page or see all GPU pricing.
A100 GPU on Spheron →

FAQ / 06

Frequently Asked Questions

Pick SXM4 any time you need NVLink between GPUs: multi-GPU training, model parallelism, 70B FP16 inference across 2+ cards, or tight gradient sync in distributed jobs. NVLink at 600 GB/s is roughly 10x faster than PCIe Gen4 for intra-node GPU traffic. Pick PCIe for single-GPU inference, data processing, or workloads where you never cross the GPU boundary.

A single A100 partitions into up to 7 isolated MIG instances, each with dedicated SM compute, L2 cache, memory, and DRAM bandwidth. Valid profiles include 1g.10gb, 2g.20gb, 3g.40gb, 4g.40gb, and 7g.80gb (the last is the whole GPU). MIG is most useful for packing multiple small inference services onto one card without noisy-neighbor effects.

Checkpoint every 15 to 30 minutes to persistent storage. Spot reclaims happen on minutes-of-notice, so a checkpoint interval shorter than 30 minutes keeps re-run cost under an hour in the worst case. Save both model weights and optimizer state, and put them on a volume that survives instance termination.

Not in FP16 on one GPU. A 70B model in FP16 weights alone is about 140 GB, and training adds optimizer state and gradients on top. You need at least 2 A100 80GB with NVLink for 70B FP16 inference, and practically 8+ A100 80GB with DeepSpeed ZeRO-3 or FSDP for training. For a single-GPU experience on 70B, look at H200 (141 GB) or B200 (192 GB).

7B to 30B parameter training and fine-tuning, quantized 70B inference, classical computer vision, recommender systems, and GPU data analytics. That workload band is where A100's 80GB HBM2e, mature ecosystem, and 40 to 60 percent lower hourly cost than H100 pay off the most.

H100 is roughly 2.5 to 3x faster on Tensor Core math with FP8 support and 1.7x more memory bandwidth. For compute-bound FP8 training, H100 is the right answer. For inference on models that fit in 80 GB without FP8, or training up to 30B, A100 usually wins on dollars per token because its hourly rate is 40 to 60 percent lower.

Why the A100 is Still the Default in 2026

A100 Technical Specifications

Key Architecture Features

Ampere Tensor Cores

HBM2e Memory and Bandwidth

NVLink Interconnect

Multi-Instance GPU (MIG)

Model Capacity on A100

Choosing an A100 Configuration

Understanding SXM vs PCIe

Spot vs Dedicated: Picking the Right Tier

Common Use Cases

When A100 Is the Right Choice

Frequently Asked Questions

01When should I pick A100 SXM4 over A100 PCIe?

02How many MIG instances can you run on one A100?

03What is the checkpointing cadence for A100 spot instances?

04Can A100 train 70B models directly?

05What is the sweet spot for A100 in 2026?

06How does A100 compare to H100 for real workloads?

Build what's next.