Choosing between L40S and A100 for LLM inference comes down to one question: does FP8's compute advantage on Ada Lovelace overcome the A100's superior memory bandwidth? The answer depends on your batch size, context length, and whether you're comparing on-demand to on-demand or factoring in spot availability.
This post works through the specs, vLLM benchmarks for Llama 3.1 70B and Qwen3 32B, the cost-per-million-token math using live Spheron pricing, and a migration playbook if you decide to move from A100 to L40S.
TL;DR: L40S vs A100 at a Glance
| Metric | L40S | A100 40GB | A100 80GB |
|---|---|---|---|
| Architecture | Ada Lovelace | Ampere | Ampere |
| VRAM | 48GB GDDR6 | 40GB HBM2e | 80GB HBM2e |
| Memory Bandwidth | 864 GB/s | 1,555 GB/s | 2,039 GB/s |
| FP8 Tensor Cores | Yes (4th gen) | No | No |
| BF16 TFLOPS | 362 | 312 | 312 |
| FP8 TFLOPS | 733 | N/A | N/A |
| NVLink | None (PCIe only) | NVLink 3.0 | NVLink 3.0 |
| Interconnect | PCIe 4.0 | PCIe 4.0 / SXM | PCIe 4.0 / SXM |
| TDP | 350W | 300W | 400W |
| Best for | Batch inference (FP8) | Fine-tuning, NLP | Long-context, training |
The key tradeoff: L40S wins inference cost-per-token at batch 8+ because FP8 delivers 733 TFLOPS (2.35x A100's BF16), with 1-byte weights that halve memory pressure. A100 wins on memory-bandwidth-bound workloads (long context beyond 32K tokens, low-batch single-stream serving) and multi-GPU training via NVLink.
Architecture: Ada Lovelace vs Ampere
FP8 Tensor Cores
The most consequential hardware difference is FP8 support. L40S ships with 4th-generation Tensor Cores that execute FP8 (E4M3 and E5M2) natively. A100 uses 3rd-generation Tensor Cores with FP16, BF16, INT8, and INT4 support only. No FP8.
In practical terms: loading a 70B model in FP8 requires 70GB (1 byte per parameter), which fits on two 48GB L40S cards. The same model in BF16 is 140GB, also requiring two 80GB A100s. FP8 on L40S delivers 733 TFLOPS vs A100's 312 BF16 TFLOPS. At compute-dominated batch sizes (batch 8+), that 2.35x compute ratio translates directly to higher throughput.
To activate FP8 on L40S, use --quantization fp8 in vLLM. Dynamic FP8 quantization at load time typically shows 3-5% accuracy degradation on most instruction-tuned models. Production-tested FP8 checkpoints from Hugging Face (e.g., meta-llama/Llama-3.1-70B-Instruct-FP8) have pre-calibrated scales and show less degradation.
Memory Type: GDDR6 vs HBM2e
L40S uses GDDR6 at 864 GB/s. A100 80GB uses HBM2e at 2,039 GB/s. That 2.35x bandwidth difference is the primary reason A100 wins at low batch sizes where inference is memory-bandwidth-bound.
In the decode phase, every generated token requires reading the entire model weight tensor once. For a 70B FP8 model (~70GB), the L40S takes about 81ms to load all weights per step (70GB / 864 GB/s). The A100 in BF16 reads 140GB of weights, which takes about 69ms (140GB / 2,039 GB/s). At batch 1, the GPU spends most of its time doing these weight reads, and A100's bandwidth advantage is real though smaller than a naive byte-count comparison suggests. At batch 8+, the time is dominated by compute (the weight matrix multiply), and L40S's FP8 TFLOPS advantage takes over.
NVLink and Multi-GPU Interconnect
A100 SXM4 uses NVLink 3.0 at 600 GB/s total bidirectional bandwidth. L40S has no NVLink. Multi-card L40S configurations communicate via PCIe 4.0, which is 64 GB/s per x16 slot. For tensor-parallel inference (TP=2), PCIe is sufficient: the all-reduce operations between two cards are small relative to the computation. For training all-reduce on large models, PCIe is a real bottleneck compared to NVLink.
The practical impact: if you're running inference only (not training), the NVLink absence on L40S is a minor issue for TP=2. For TP=4 or TP=8 inference at scale, NVLink-connected A100 configurations will maintain lower inter-GPU latency.
NVENC and Display
L40S includes NVENC (hardware video encoding) and a display engine. A100 omits both. This is relevant only for workloads mixing inference with video transcoding or rendering. For pure LLM serving, both GPUs ignore these blocks entirely.
Full Specifications
| Spec | L40S | A100 SXM4 (40GB) | A100 SXM4 (80GB) | A100 PCIe (80GB) |
|---|---|---|---|---|
| Architecture | Ada Lovelace | Ampere | Ampere | Ampere |
| CUDA Cores | 18,176 | 6,912 | 6,912 | 6,912 |
| VRAM | 48GB GDDR6 | 40GB HBM2e | 80GB HBM2e | 80GB HBM2e |
| Memory BW | 864 GB/s | 1,555 GB/s | 2,039 GB/s | 1,935 GB/s |
| FP8 TFLOPS | 733 | N/A | N/A | N/A |
| FP16 TFLOPS | 362 | 312 | 312 | 312 |
| BF16 TFLOPS | 362 | 312 | 312 | 312 |
| INT8 TOPS | 733 | 624 | 624 | 624 |
| TDP | 350W | 400W | 400W | 300W |
| NVLink BW | N/A | 600 GB/s | 600 GB/s | N/A |
| PCIe | Gen 4.0 | Gen 4.0 | Gen 4.0 | Gen 4.0 |
| ECC | Yes | Yes | Yes | Yes |
| NVENC | Yes | No | No | No |
The A100 PCIe 80GB has slightly lower memory bandwidth (1,935 GB/s) than the SXM4 variant (2,039 GB/s) due to the different board design. For inference without NVLink, A100 PCIe is a close comparison point to L40S, and the bandwidth ratio drops to 2.24x (still significant).
Inference Benchmarks on vLLM
The benchmarks below are approximate community estimates based on vLLM 0.4.x, tensor-parallel-size 2 for 70B models, input length 512 tokens, output length 128 tokens. Numbers vary by vLLM version, driver, and server configuration. They are labeled as approximate rather than original measurements.
Llama 3.1 70B: 2x L40S FP8 vs 2x A100 80GB BF16
This comparison requires 2 GPUs per configuration. L40S FP8 uses --quantization fp8 with sm89 support (CUDA 11.8+, Ada Lovelace). A100 uses BF16 (no FP8 support on Ampere).
| GPU Config | Quantization | Batch Size | Throughput (tok/s) | TTFT p50 (ms) |
|---|---|---|---|---|
| 2x L40S | FP8 | 1 | ~90 | ~900 |
| 2x L40S | FP8 | 8 | ~800 | ~1,000 |
| 2x L40S | FP8 | 32 | ~2,600 | ~1,500 |
| 2x A100 80GB SXM4 | BF16 | 1 | ~170 | ~580 |
| 2x A100 80GB SXM4 | BF16 | 8 | ~500 | ~720 |
| 2x A100 80GB SXM4 | BF16 | 32 | ~1,400 | ~1,000 |
Approximate community benchmarks. Actual throughput varies by vLLM version (0.4+), CUDA driver, and hardware configuration. Reference: vLLM GitHub benchmarking discussions and community reports.
At batch 1 (single-stream serving), A100 BF16 leads by roughly 1.9x due to its higher memory bandwidth. Every token generation requires loading the full KV cache and weight activations, and the A100's 2,039 GB/s handles this faster than L40S's 864 GB/s even accounting for FP8's smaller memory footprint.
The crossover happens around batch 4-8. By batch 8, FP8 compute on the L40S (~800 tok/s) overtakes A100 BF16 (~500 tok/s) by about 1.6x. At batch 32, the L40S FP8 advantage grows to roughly 1.85x as the workload becomes increasingly compute-bound.
A critical constraint: FP8 on L40S requires CUDA sm89 (Ada Lovelace architecture) and a driver supporting CUDA 11.8+. A100 (sm80) does not support FP8. If you launch vLLM with --quantization fp8 on an A100, it falls back to INT8 or BF16 depending on the vLLM version.
Qwen3 32B: Single L40S vs Single A100 80GB
For a 32B model, FP8 weights are 32GB, which fits on a single 48GB L40S. BF16 weights are 64GB, which fits on a single 80GB A100. This is a meaningful difference: you can run 32B inference on one L40S where you need a full 80GB A100 for BF16.
| GPU | Quantization | Batch Size | Throughput (tok/s) | TTFT p50 (ms) |
|---|---|---|---|---|
| L40S | FP8 | 1 | ~220 | ~320 |
| L40S | FP8 | 8 | ~1,500 | ~380 |
| A100 80GB | BF16 | 1 | ~300 | ~220 |
| A100 80GB | BF16 | 8 | ~870 | ~310 |
Approximate estimates. A100 80GB BF16 32B benchmarks based on community vLLM reports. L40S FP8 32B based on interpolation from published Ada Lovelace FP8 throughput ratios.
At batch 1, A100 leads due to its higher memory bandwidth. The 64GB BF16 weight tensor reads faster over HBM2e (2,039 GB/s) than the 32GB FP8 tensor over GDDR6 (864 GB/s), since the bandwidth ratio (2.35x) exceeds the FP8 size advantage (2x). TTFT is noticeably lower on A100 for single-stream interactive serving.
At batch 8, the compute-to-bandwidth ratio shifts. L40S FP8's 733 TFLOPS processes the larger attention matrix multiply faster than A100's 312 BF16 TFLOPS, flipping the result. L40S FP8 pulls ahead at batch 8, though the margin is narrower than for 70B models.
Note: these figures are approximate. Actual results depend on sequence lengths, vLLM version, and driver configuration. Run your own benchmarks before migrating production workloads.
Memory and Batch Sizing
Bandwidth-Bound vs Compute-Bound Inference
LLM inference has two regimes. In the decode phase (generating one token at a time), the GPU must read all model weights once per token. When batch size is low (1-4), this weight read dominates total execution time and throughput scales with memory bandwidth. In this regime, A100 80GB's 2,039 GB/s gives it a clear advantage over L40S's 864 GB/s.
At larger batch sizes (8+), the GPU processes multiple requests simultaneously. The weight read is amortized across the batch, and compute throughput becomes the bottleneck. Here, L40S FP8's 733 TFLOPS significantly outperforms A100's 312 BF16 TFLOPS.
The crossover point depends on model size. For 70B with TP=2:
| Batch Size | Throughput Bottleneck | L40S FP8 vs A100 BF16 |
|---|---|---|
| 1 | Memory bandwidth (weight reads) | A100 wins (~1.9x faster) |
| 4 | Mixed | A100 slightly ahead |
| 8 | Mixed-to-compute | L40S slightly ahead |
| 16 | Compute | L40S ahead (~1.7x) |
| 32 | Compute | L40S ahead (~1.85x) |
Long-Context Inference and KV Cache
At context lengths above 4K, KV cache size grows and becomes a significant portion of the VRAM budget. The KV cache per request formula:
kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_elementFor Llama 3.1 70B (80 layers, 8 KV heads, 128 head dim) at 32K context in BF16: approximately 10.0GB per request. In FP8 KV cache, this drops to 5.0GB per request.
| Config | Model Weights | KV Cache per request (32K BF16) | Concurrent requests |
|---|---|---|---|
| 2x A100 80GB (BF16 weights) | ~140GB total | ~10.0GB | ~2 |
| 2x L40S (FP8 weights) | ~70GB total | ~5.0GB (FP8 KV) | ~5 |
At long contexts, the A100's HBM bandwidth matters more: reading 10.0GB of KV cache per decode step at 2,039 GB/s takes ~4.9ms, vs L40S at 864 GB/s taking ~11.6ms for the same 10.0GB (or ~5.8ms for FP8 KV at 5.0GB). For latency-sensitive long-context serving above 16K tokens, the A100 delivers substantially lower TTFT.
For workloads with context lengths consistently below 8K tokens, this bandwidth gap is a secondary concern. For RAG pipelines, legal document analysis, or code review at 32K+ context, stick with A100 80GB.
Cost-Per-Million-Tokens Analysis
Using live Spheron pricing from 04 May 2026 and approximate community benchmarks for Llama 3.1 70B at batch 8 with tensor-parallel-size 2:
cost_per_1M_output_tokens = (hourly_rate_per_gpu * num_gpus / throughput_tokens_per_sec) * 1_000_000 / 3600| Config | GPUs | $/hr/GPU | Total $/hr | Throughput (tok/s, batch 8) | $/1M tokens |
|---|---|---|---|---|---|
| L40S FP8 (on-demand) | 2 | $0.72 | $1.44 | ~800 | ~$0.50 |
| A100 80GB SXM4 BF16 (on-demand) | 2 | $1.64 | $3.28 | ~500 | ~$1.82 |
| A100 80GB SXM4 BF16 (spot) | 2 | $0.45 | $0.90 | ~500 | ~$0.50 |
At batch 8, L40S on-demand FP8 (~$0.50/1M tokens) is about 73% cheaper than A100 SXM4 on-demand BF16 (~$1.82/1M tokens). The gap widens at batch 32 where L40S throughput grows to ~2,600 tok/s vs A100's ~1,400 tok/s:
| Config | Total $/hr | Throughput (tok/s, batch 32) | $/1M tokens |
|---|---|---|---|
| L40S FP8 (on-demand) | $1.44 | ~2,600 | ~$0.15 |
| A100 80GB SXM4 BF16 (on-demand) | $3.28 | ~1,400 | ~$0.65 |
| A100 80GB SXM4 BF16 (spot) | $0.90 | ~1,400 | ~$0.18 |
At batch 32, L40S on-demand ($0.15/1M) beats A100 on-demand ($0.65/1M) by ~77%. A100 spot ($0.18/1M) remains competitive with L40S on-demand when spot availability allows.
The practical split: if you can tolerate preemption risk, A100 spot ($0.50/1M at batch 8) ties L40S on-demand. If you need dedicated on-demand capacity, L40S FP8 beats A100 on-demand by ~3.6x at batch 8+ for Llama 3.1 70B.
Pricing fluctuates based on GPU availability. The prices above are based on 04 May 2026 and may have changed. Check current GPU pricing for live rates.
When A100 Still Wins
Long-Context Inference Past 32K Tokens
A100 80GB's 2,039 GB/s HBM bandwidth handles the KV cache reads at large contexts significantly faster than L40S's 864 GB/s. At 64K context, a single 70B request generates ~21.5GB of KV cache per step. A100 reads this in ~10.5ms; L40S needs ~25ms with BF16 KV. For interactive long-context use cases (legal review, code analysis, document summarization), the A100's TTFT advantage is felt by end users.
Multi-GPU Training with NVLink
A100 SXM4 connects to peer GPUs via NVLink 3.0 at 600 GB/s bidirectional. L40S has no NVLink. For distributed fine-tuning with tensor parallelism or ZeRO-3, the gradient synchronization overhead on L40S (PCIe at 64 GB/s) adds meaningful latency. An 8x A100 NVLink cluster runs all-reduce on large gradient tensors roughly 10x faster than a comparable L40S PCIe configuration.
Fine-Tuning 70B+ Models in BF16
Two 48GB L40S cards hold 96GB of VRAM total. A 70B model in BF16 is 140GB. Training adds optimizer states (Adam: 2x more VRAM) and gradients. Even with FP8 for forward pass, full fine-tuning of 70B requires more VRAM than two L40S cards can provide. Two 80GB A100s hold 160GB total, which is sufficient for LoRA fine-tuning of 70B at BF16 without spilling to CPU. For fine-tuning scenarios, rent A100 on Spheron with NVLink SXM4 configurations.
Legacy cuDNN Pipelines Tuned for Ampere
Some production deployments have Ampere-specific kernel optimizations in cuDNN, TensorRT, or custom CUDA code. Running those on Ada Lovelace usually works (CUDA backward compatibility), but you may not get the same performance without re-profiling. If you have heavily optimized A100 kernels and a tight delivery timeline, staying on A100 avoids a profiling and tuning cycle.
Mixed Training and Inference on One Node
A100 80GB handles both 70B LoRA fine-tuning and production BF16 inference without quantization. If your team runs weekly fine-tuning jobs and serves the updated model between runs on the same node, A100 is operationally simpler: no FP8 conversion pipeline, no checkpoint format management.
Migration Playbook: A100 to L40S
Step 1: Audit Your A100 Baseline
Profile your current serving stack before touching the infrastructure:
# Run vLLM benchmark on your existing A100 endpoint
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--num-prompts 1000 \
--input-len 512 \
--output-len 128
# Record: total throughput (tok/s), p50/p95 TTFT, and actual GPU utilizationNote your current batch sizes under production load. If sustained batch size is consistently below 4, the A100's bandwidth advantage may actually be serving you better on TTFT than L40S FP8 would.
Step 2: Validate FP8 on Your Model
Check Hugging Face for an official FP8 checkpoint of your model. For Llama 3.1 70B, meta-llama/Llama-3.1-70B-Instruct-FP8 exists and has pre-calibrated scales. For models without an official FP8 variant:
# Dynamic FP8 quantization at load time (more accessible, slightly higher quality loss)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--quantization fp8 \
--dtype float16 \
--port 8000Compare perplexity on your evaluation set between the BF16 baseline and FP8 quantized version. Most instruction-tuned models show less than 5% perplexity increase with dynamic FP8.
Step 3: Estimate VRAM on L40S
Calculate model weight VRAM first:
FP8 VRAM = parameter_count (billions) × 1 (GB per billion params)
BF16 VRAM = parameter_count × 2 GBFor 70B in FP8: 70GB across 2x 48GB L40S (total 96GB capacity). Remaining for KV cache: ~26GB.
For 32B in FP8: 32GB on a single 48GB L40S. Remaining for KV cache: ~16GB.
KV cache overhead per request:
kv_per_request = 2 × layers × kv_heads × head_dim × context_len × bytes_per_elementFor Llama 3.1 70B FP8 KV at 4K context: ~0.63GB per concurrent request. With 26GB available: ~41 concurrent requests before KV cache exhaustion.
Run vLLM with --max-model-len set to your target context to see actual VRAM allocation:
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--quantization fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--port 8000
# Check startup logs for "GPU blocks allocated"Step 4: Run Parallel Benchmarks
Test L40S and A100 with identical configs:
# A100 benchmark (BF16)
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--num-prompts 1000 \
--input-len 512 \
--output-len 128
# L40S benchmark (FP8)
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--quantization fp8 \
--dtype float16 \
--num-prompts 1000 \
--input-len 512 \
--output-len 128Check TTFT specifically at your production batch size, not just aggregate throughput. If your use case involves interactive responses (sub-500ms p95 TTFT), verify that L40S FP8 meets that SLO before migrating.
Step 5: Calculate Cost-Per-Token and Decide
Use the formula with your measured throughput:
hourly_rate_l40s = 0.72 # per GPU, from Spheron API (04 May 2026)
hourly_rate_a100 = 1.64 # per GPU, A100 80GB SXM4 on-demand
num_gpus = 2
l40s_tps = 800 # replace with your measured benchmark
a100_tps = 500 # replace with your measured benchmark
cost_per_1m_l40s = (hourly_rate_l40s * num_gpus / l40s_tps) * 1_000_000 / 3600
cost_per_1m_a100 = (hourly_rate_a100 * num_gpus / a100_tps) * 1_000_000 / 3600
print(f"L40S on-demand: ${cost_per_1m_l40s:.2f}/1M tokens")
print(f"A100 on-demand: ${cost_per_1m_a100:.2f}/1M tokens")If L40S cost-per-token is lower AND TTFT SLOs are met at your production batch size, migrate. If A100 context-length performance is critical or your batch sizes are consistently below 4, stay on A100 or use A100 spot for the cost advantage.
Renting L40S and A100 on Spheron
Spheron provides bare-metal on-demand access to both GPUs with per-minute billing. There are no long-term commitments, and you can mix GPU types within the same project: run A100 nodes for fine-tuning pipelines and L40S nodes for inference endpoints without any per-request serverless markup or model-loading overhead.
For L40S GPU rental, the current on-demand rate starts at $0.72/GPU/hr. For A100, on-demand starts at $1.64/GPU/hr or $0.45/GPU/hr on spot. A practical migration path: provision L40S on-demand for your inference fleet and keep A100 spot instances for fine-tuning jobs, paying the lower spot rate only when the training run is actually executing.
For batch time series forecasting with models like Chronos or TimesFM, the L40S 48GB can serve all model sizes simultaneously with headroom for large batch sizes. The 48 GB VRAM accommodates the full Chronos suite (Small through Large) plus a batch queue, making it a cost-effective single-instance setup. This is covered in depth in the time series foundation model GPU guide.
Check GPU pricing for current rates across all GPU models since marketplace prices shift with supply and demand.
Teams switching from A100 on-demand inference fleets to L40S typically cut cost-per-token by 70%+ on batch workloads at batch 16+. Spheron lets you provision both GPUs on-demand with no long-term commitments.
