Comparison

NVIDIA L40S vs A100: Inference Throughput, VRAM, and Cost-Per-Token (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 4, 2026
L40S vs A100NVIDIA L40S vs A100L40S A100 InferenceL40S A100 Cost Per TokenFP8 InferenceAda Lovelace vs AmpereL40S vLLMLlama 3.1 70B InferencevLLM Benchmarks
NVIDIA L40S vs A100: Inference Throughput, VRAM, and Cost-Per-Token (2026)

Choosing between L40S and A100 for LLM inference comes down to one question: does FP8's compute advantage on Ada Lovelace overcome the A100's superior memory bandwidth? The answer depends on your batch size, context length, and whether you're comparing on-demand to on-demand or factoring in spot availability.

This post works through the specs, vLLM benchmarks for Llama 3.1 70B and Qwen3 32B, the cost-per-million-token math using live Spheron pricing, and a migration playbook if you decide to move from A100 to L40S.

TL;DR: L40S vs A100 at a Glance

MetricL40SA100 40GBA100 80GB
ArchitectureAda LovelaceAmpereAmpere
VRAM48GB GDDR640GB HBM2e80GB HBM2e
Memory Bandwidth864 GB/s1,555 GB/s2,039 GB/s
FP8 Tensor CoresYes (4th gen)NoNo
BF16 TFLOPS362312312
FP8 TFLOPS733N/AN/A
NVLinkNone (PCIe only)NVLink 3.0NVLink 3.0
InterconnectPCIe 4.0PCIe 4.0 / SXMPCIe 4.0 / SXM
TDP350W300W400W
Best forBatch inference (FP8)Fine-tuning, NLPLong-context, training

The key tradeoff: L40S wins inference cost-per-token at batch 8+ because FP8 delivers 733 TFLOPS (2.35x A100's BF16), with 1-byte weights that halve memory pressure. A100 wins on memory-bandwidth-bound workloads (long context beyond 32K tokens, low-batch single-stream serving) and multi-GPU training via NVLink.

Architecture: Ada Lovelace vs Ampere

FP8 Tensor Cores

The most consequential hardware difference is FP8 support. L40S ships with 4th-generation Tensor Cores that execute FP8 (E4M3 and E5M2) natively. A100 uses 3rd-generation Tensor Cores with FP16, BF16, INT8, and INT4 support only. No FP8.

In practical terms: loading a 70B model in FP8 requires 70GB (1 byte per parameter), which fits on two 48GB L40S cards. The same model in BF16 is 140GB, also requiring two 80GB A100s. FP8 on L40S delivers 733 TFLOPS vs A100's 312 BF16 TFLOPS. At compute-dominated batch sizes (batch 8+), that 2.35x compute ratio translates directly to higher throughput.

To activate FP8 on L40S, use --quantization fp8 in vLLM. Dynamic FP8 quantization at load time typically shows 3-5% accuracy degradation on most instruction-tuned models. Production-tested FP8 checkpoints from Hugging Face (e.g., meta-llama/Llama-3.1-70B-Instruct-FP8) have pre-calibrated scales and show less degradation.

Memory Type: GDDR6 vs HBM2e

L40S uses GDDR6 at 864 GB/s. A100 80GB uses HBM2e at 2,039 GB/s. That 2.35x bandwidth difference is the primary reason A100 wins at low batch sizes where inference is memory-bandwidth-bound.

In the decode phase, every generated token requires reading the entire model weight tensor once. For a 70B FP8 model (~70GB), the L40S takes about 81ms to load all weights per step (70GB / 864 GB/s). The A100 in BF16 reads 140GB of weights, which takes about 69ms (140GB / 2,039 GB/s). At batch 1, the GPU spends most of its time doing these weight reads, and A100's bandwidth advantage is real though smaller than a naive byte-count comparison suggests. At batch 8+, the time is dominated by compute (the weight matrix multiply), and L40S's FP8 TFLOPS advantage takes over.

A100 SXM4 uses NVLink 3.0 at 600 GB/s total bidirectional bandwidth. L40S has no NVLink. Multi-card L40S configurations communicate via PCIe 4.0, which is 64 GB/s per x16 slot. For tensor-parallel inference (TP=2), PCIe is sufficient: the all-reduce operations between two cards are small relative to the computation. For training all-reduce on large models, PCIe is a real bottleneck compared to NVLink.

The practical impact: if you're running inference only (not training), the NVLink absence on L40S is a minor issue for TP=2. For TP=4 or TP=8 inference at scale, NVLink-connected A100 configurations will maintain lower inter-GPU latency.

NVENC and Display

L40S includes NVENC (hardware video encoding) and a display engine. A100 omits both. This is relevant only for workloads mixing inference with video transcoding or rendering. For pure LLM serving, both GPUs ignore these blocks entirely.

Full Specifications

SpecL40SA100 SXM4 (40GB)A100 SXM4 (80GB)A100 PCIe (80GB)
ArchitectureAda LovelaceAmpereAmpereAmpere
CUDA Cores18,1766,9126,9126,912
VRAM48GB GDDR640GB HBM2e80GB HBM2e80GB HBM2e
Memory BW864 GB/s1,555 GB/s2,039 GB/s1,935 GB/s
FP8 TFLOPS733N/AN/AN/A
FP16 TFLOPS362312312312
BF16 TFLOPS362312312312
INT8 TOPS733624624624
TDP350W400W400W300W
NVLink BWN/A600 GB/s600 GB/sN/A
PCIeGen 4.0Gen 4.0Gen 4.0Gen 4.0
ECCYesYesYesYes
NVENCYesNoNoNo

The A100 PCIe 80GB has slightly lower memory bandwidth (1,935 GB/s) than the SXM4 variant (2,039 GB/s) due to the different board design. For inference without NVLink, A100 PCIe is a close comparison point to L40S, and the bandwidth ratio drops to 2.24x (still significant).

Inference Benchmarks on vLLM

The benchmarks below are approximate community estimates based on vLLM 0.4.x, tensor-parallel-size 2 for 70B models, input length 512 tokens, output length 128 tokens. Numbers vary by vLLM version, driver, and server configuration. They are labeled as approximate rather than original measurements.

Llama 3.1 70B: 2x L40S FP8 vs 2x A100 80GB BF16

This comparison requires 2 GPUs per configuration. L40S FP8 uses --quantization fp8 with sm89 support (CUDA 11.8+, Ada Lovelace). A100 uses BF16 (no FP8 support on Ampere).

GPU ConfigQuantizationBatch SizeThroughput (tok/s)TTFT p50 (ms)
2x L40SFP81~90~900
2x L40SFP88~800~1,000
2x L40SFP832~2,600~1,500
2x A100 80GB SXM4BF161~170~580
2x A100 80GB SXM4BF168~500~720
2x A100 80GB SXM4BF1632~1,400~1,000

Approximate community benchmarks. Actual throughput varies by vLLM version (0.4+), CUDA driver, and hardware configuration. Reference: vLLM GitHub benchmarking discussions and community reports.

At batch 1 (single-stream serving), A100 BF16 leads by roughly 1.9x due to its higher memory bandwidth. Every token generation requires loading the full KV cache and weight activations, and the A100's 2,039 GB/s handles this faster than L40S's 864 GB/s even accounting for FP8's smaller memory footprint.

The crossover happens around batch 4-8. By batch 8, FP8 compute on the L40S (~800 tok/s) overtakes A100 BF16 (~500 tok/s) by about 1.6x. At batch 32, the L40S FP8 advantage grows to roughly 1.85x as the workload becomes increasingly compute-bound.

A critical constraint: FP8 on L40S requires CUDA sm89 (Ada Lovelace architecture) and a driver supporting CUDA 11.8+. A100 (sm80) does not support FP8. If you launch vLLM with --quantization fp8 on an A100, it falls back to INT8 or BF16 depending on the vLLM version.

Qwen3 32B: Single L40S vs Single A100 80GB

For a 32B model, FP8 weights are 32GB, which fits on a single 48GB L40S. BF16 weights are 64GB, which fits on a single 80GB A100. This is a meaningful difference: you can run 32B inference on one L40S where you need a full 80GB A100 for BF16.

GPUQuantizationBatch SizeThroughput (tok/s)TTFT p50 (ms)
L40SFP81~220~320
L40SFP88~1,500~380
A100 80GBBF161~300~220
A100 80GBBF168~870~310

Approximate estimates. A100 80GB BF16 32B benchmarks based on community vLLM reports. L40S FP8 32B based on interpolation from published Ada Lovelace FP8 throughput ratios.

At batch 1, A100 leads due to its higher memory bandwidth. The 64GB BF16 weight tensor reads faster over HBM2e (2,039 GB/s) than the 32GB FP8 tensor over GDDR6 (864 GB/s), since the bandwidth ratio (2.35x) exceeds the FP8 size advantage (2x). TTFT is noticeably lower on A100 for single-stream interactive serving.

At batch 8, the compute-to-bandwidth ratio shifts. L40S FP8's 733 TFLOPS processes the larger attention matrix multiply faster than A100's 312 BF16 TFLOPS, flipping the result. L40S FP8 pulls ahead at batch 8, though the margin is narrower than for 70B models.

Note: these figures are approximate. Actual results depend on sequence lengths, vLLM version, and driver configuration. Run your own benchmarks before migrating production workloads.

Memory and Batch Sizing

Bandwidth-Bound vs Compute-Bound Inference

LLM inference has two regimes. In the decode phase (generating one token at a time), the GPU must read all model weights once per token. When batch size is low (1-4), this weight read dominates total execution time and throughput scales with memory bandwidth. In this regime, A100 80GB's 2,039 GB/s gives it a clear advantage over L40S's 864 GB/s.

At larger batch sizes (8+), the GPU processes multiple requests simultaneously. The weight read is amortized across the batch, and compute throughput becomes the bottleneck. Here, L40S FP8's 733 TFLOPS significantly outperforms A100's 312 BF16 TFLOPS.

The crossover point depends on model size. For 70B with TP=2:

Batch SizeThroughput BottleneckL40S FP8 vs A100 BF16
1Memory bandwidth (weight reads)A100 wins (~1.9x faster)
4MixedA100 slightly ahead
8Mixed-to-computeL40S slightly ahead
16ComputeL40S ahead (~1.7x)
32ComputeL40S ahead (~1.85x)

Long-Context Inference and KV Cache

At context lengths above 4K, KV cache size grows and becomes a significant portion of the VRAM budget. The KV cache per request formula:

kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For Llama 3.1 70B (80 layers, 8 KV heads, 128 head dim) at 32K context in BF16: approximately 10.0GB per request. In FP8 KV cache, this drops to 5.0GB per request.

ConfigModel WeightsKV Cache per request (32K BF16)Concurrent requests
2x A100 80GB (BF16 weights)~140GB total~10.0GB~2
2x L40S (FP8 weights)~70GB total~5.0GB (FP8 KV)~5

At long contexts, the A100's HBM bandwidth matters more: reading 10.0GB of KV cache per decode step at 2,039 GB/s takes ~4.9ms, vs L40S at 864 GB/s taking ~11.6ms for the same 10.0GB (or ~5.8ms for FP8 KV at 5.0GB). For latency-sensitive long-context serving above 16K tokens, the A100 delivers substantially lower TTFT.

For workloads with context lengths consistently below 8K tokens, this bandwidth gap is a secondary concern. For RAG pipelines, legal document analysis, or code review at 32K+ context, stick with A100 80GB.

Cost-Per-Million-Tokens Analysis

Using live Spheron pricing from 04 May 2026 and approximate community benchmarks for Llama 3.1 70B at batch 8 with tensor-parallel-size 2:

cost_per_1M_output_tokens = (hourly_rate_per_gpu * num_gpus / throughput_tokens_per_sec) * 1_000_000 / 3600
ConfigGPUs$/hr/GPUTotal $/hrThroughput (tok/s, batch 8)$/1M tokens
L40S FP8 (on-demand)2$0.72$1.44~800~$0.50
A100 80GB SXM4 BF16 (on-demand)2$1.64$3.28~500~$1.82
A100 80GB SXM4 BF16 (spot)2$0.45$0.90~500~$0.50

At batch 8, L40S on-demand FP8 (~$0.50/1M tokens) is about 73% cheaper than A100 SXM4 on-demand BF16 (~$1.82/1M tokens). The gap widens at batch 32 where L40S throughput grows to ~2,600 tok/s vs A100's ~1,400 tok/s:

ConfigTotal $/hrThroughput (tok/s, batch 32)$/1M tokens
L40S FP8 (on-demand)$1.44~2,600~$0.15
A100 80GB SXM4 BF16 (on-demand)$3.28~1,400~$0.65
A100 80GB SXM4 BF16 (spot)$0.90~1,400~$0.18

At batch 32, L40S on-demand ($0.15/1M) beats A100 on-demand ($0.65/1M) by ~77%. A100 spot ($0.18/1M) remains competitive with L40S on-demand when spot availability allows.

The practical split: if you can tolerate preemption risk, A100 spot ($0.50/1M at batch 8) ties L40S on-demand. If you need dedicated on-demand capacity, L40S FP8 beats A100 on-demand by ~3.6x at batch 8+ for Llama 3.1 70B.

Pricing fluctuates based on GPU availability. The prices above are based on 04 May 2026 and may have changed. Check current GPU pricing for live rates.

When A100 Still Wins

Long-Context Inference Past 32K Tokens

A100 80GB's 2,039 GB/s HBM bandwidth handles the KV cache reads at large contexts significantly faster than L40S's 864 GB/s. At 64K context, a single 70B request generates ~21.5GB of KV cache per step. A100 reads this in ~10.5ms; L40S needs ~25ms with BF16 KV. For interactive long-context use cases (legal review, code analysis, document summarization), the A100's TTFT advantage is felt by end users.

A100 SXM4 connects to peer GPUs via NVLink 3.0 at 600 GB/s bidirectional. L40S has no NVLink. For distributed fine-tuning with tensor parallelism or ZeRO-3, the gradient synchronization overhead on L40S (PCIe at 64 GB/s) adds meaningful latency. An 8x A100 NVLink cluster runs all-reduce on large gradient tensors roughly 10x faster than a comparable L40S PCIe configuration.

Fine-Tuning 70B+ Models in BF16

Two 48GB L40S cards hold 96GB of VRAM total. A 70B model in BF16 is 140GB. Training adds optimizer states (Adam: 2x more VRAM) and gradients. Even with FP8 for forward pass, full fine-tuning of 70B requires more VRAM than two L40S cards can provide. Two 80GB A100s hold 160GB total, which is sufficient for LoRA fine-tuning of 70B at BF16 without spilling to CPU. For fine-tuning scenarios, rent A100 on Spheron with NVLink SXM4 configurations.

Legacy cuDNN Pipelines Tuned for Ampere

Some production deployments have Ampere-specific kernel optimizations in cuDNN, TensorRT, or custom CUDA code. Running those on Ada Lovelace usually works (CUDA backward compatibility), but you may not get the same performance without re-profiling. If you have heavily optimized A100 kernels and a tight delivery timeline, staying on A100 avoids a profiling and tuning cycle.

Mixed Training and Inference on One Node

A100 80GB handles both 70B LoRA fine-tuning and production BF16 inference without quantization. If your team runs weekly fine-tuning jobs and serves the updated model between runs on the same node, A100 is operationally simpler: no FP8 conversion pipeline, no checkpoint format management.

Migration Playbook: A100 to L40S

Step 1: Audit Your A100 Baseline

Profile your current serving stack before touching the infrastructure:

bash
# Run vLLM benchmark on your existing A100 endpoint
python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128

# Record: total throughput (tok/s), p50/p95 TTFT, and actual GPU utilization

Note your current batch sizes under production load. If sustained batch size is consistently below 4, the A100's bandwidth advantage may actually be serving you better on TTFT than L40S FP8 would.

Step 2: Validate FP8 on Your Model

Check Hugging Face for an official FP8 checkpoint of your model. For Llama 3.1 70B, meta-llama/Llama-3.1-70B-Instruct-FP8 exists and has pre-calibrated scales. For models without an official FP8 variant:

bash
# Dynamic FP8 quantization at load time (more accessible, slightly higher quality loss)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --dtype float16 \
  --port 8000

Compare perplexity on your evaluation set between the BF16 baseline and FP8 quantized version. Most instruction-tuned models show less than 5% perplexity increase with dynamic FP8.

Step 3: Estimate VRAM on L40S

Calculate model weight VRAM first:

FP8 VRAM = parameter_count (billions) × 1 (GB per billion params)
BF16 VRAM = parameter_count × 2 GB

For 70B in FP8: 70GB across 2x 48GB L40S (total 96GB capacity). Remaining for KV cache: ~26GB.

For 32B in FP8: 32GB on a single 48GB L40S. Remaining for KV cache: ~16GB.

KV cache overhead per request:

kv_per_request = 2 × layers × kv_heads × head_dim × context_len × bytes_per_element

For Llama 3.1 70B FP8 KV at 4K context: ~0.63GB per concurrent request. With 26GB available: ~41 concurrent requests before KV cache exhaustion.

Run vLLM with --max-model-len set to your target context to see actual VRAM allocation:

bash
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --port 8000
# Check startup logs for "GPU blocks allocated"

Step 4: Run Parallel Benchmarks

Test L40S and A100 with identical configs:

bash
# A100 benchmark (BF16)
python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128

# L40S benchmark (FP8)
python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --dtype float16 \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128

Check TTFT specifically at your production batch size, not just aggregate throughput. If your use case involves interactive responses (sub-500ms p95 TTFT), verify that L40S FP8 meets that SLO before migrating.

Step 5: Calculate Cost-Per-Token and Decide

Use the formula with your measured throughput:

python
hourly_rate_l40s = 0.72    # per GPU, from Spheron API (04 May 2026)
hourly_rate_a100 = 1.64    # per GPU, A100 80GB SXM4 on-demand
num_gpus = 2

l40s_tps = 800    # replace with your measured benchmark
a100_tps = 500    # replace with your measured benchmark

cost_per_1m_l40s = (hourly_rate_l40s * num_gpus / l40s_tps) * 1_000_000 / 3600
cost_per_1m_a100 = (hourly_rate_a100 * num_gpus / a100_tps) * 1_000_000 / 3600

print(f"L40S on-demand: ${cost_per_1m_l40s:.2f}/1M tokens")
print(f"A100 on-demand: ${cost_per_1m_a100:.2f}/1M tokens")

If L40S cost-per-token is lower AND TTFT SLOs are met at your production batch size, migrate. If A100 context-length performance is critical or your batch sizes are consistently below 4, stay on A100 or use A100 spot for the cost advantage.

Renting L40S and A100 on Spheron

Spheron provides bare-metal on-demand access to both GPUs with per-minute billing. There are no long-term commitments, and you can mix GPU types within the same project: run A100 nodes for fine-tuning pipelines and L40S nodes for inference endpoints without any per-request serverless markup or model-loading overhead.

For L40S GPU rental, the current on-demand rate starts at $0.72/GPU/hr. For A100, on-demand starts at $1.64/GPU/hr or $0.45/GPU/hr on spot. A practical migration path: provision L40S on-demand for your inference fleet and keep A100 spot instances for fine-tuning jobs, paying the lower spot rate only when the training run is actually executing.

For batch time series forecasting with models like Chronos or TimesFM, the L40S 48GB can serve all model sizes simultaneously with headroom for large batch sizes. The 48 GB VRAM accommodates the full Chronos suite (Small through Large) plus a batch queue, making it a cost-effective single-instance setup. This is covered in depth in the time series foundation model GPU guide.

Check GPU pricing for current rates across all GPU models since marketplace prices shift with supply and demand.


Teams switching from A100 on-demand inference fleets to L40S typically cut cost-per-token by 70%+ on batch workloads at batch 16+. Spheron lets you provision both GPUs on-demand with no long-term commitments.

Rent L40S → | Rent A100 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.