Choosing between L40S and H100 for inference is not as simple as comparing price-per-hour. L40S at $0.72/hr on Spheron is 5.33x cheaper than H100 SXM5 at $3.84/hr, and H100 delivers 2-3x more tokens per second at batch 1 and 3-5x more at batch 32+. But even at batch 32, H100's throughput advantage does not close the 5.33x price gap, so L40S wins on cost-per-token in most serving scenarios. H100 wins for 70B+ model inference on a single card, NVLink-required training, and long-context workloads.
This post works through the specs, FP8 throughput benchmarks for Llama 3.1 8B and Qwen3 14B, the cost-per-million-token math using live Spheron pricing, and a concrete decision framework for which GPU belongs in your stack. For operators comparing L40S against A100 instead, see the L40S vs A100 cost-per-token guide which runs the same analysis against the Ampere architecture.
TL;DR: When L40S Beats H100
| Scenario | Winner | Why |
|---|---|---|
| 7B-30B LLM inference, low-medium traffic (under 50% GPU util) | L40S | $0.72/hr saves 81% vs H100 SXM5 on hourly spend |
| SDXL/Flux image generation | L40S | Similar images/hr on single card; L40S costs 5.33x less per hour |
| Embedding serving at scale | L40S | Low compute demand; no NVLink needed; lower fixed cost |
| Development, testing, prototyping | L40S | Capable FP8 GPU at 81% lower hourly rate |
| High-utilization LLM batch serving (batch 32+) | L40S | H100's 3-5x throughput does not overcome the 5.33x price gap; L40S still wins on cost-per-token |
| 70B+ model inference (single card) | H100 | 80GB HBM3 fits 70B FP8 on one GPU; single-card vs two L40S has lower PCIe overhead |
| Training and fine-tuning | H100 | NVLink 900 GB/s; HBM3 bandwidth for gradient ops |
| Long-context inference (32K+ tokens) | H100 | KV cache fits in 80GB; 3.9x bandwidth advantage |
The honest summary: L40S wins on cost-per-token in most LLM serving scenarios because the 5.33x price difference exceeds H100's 2.7x-5x throughput advantage. H100 wins for 70B+ model single-card inference, NVLink-required training, and long-context workloads.
Specs: L40S vs H100 Side by Side
| Metric | L40S | H100 SXM5 | H100 PCIe |
|---|---|---|---|
| Architecture | Ada Lovelace | Hopper (GH100) | Hopper (GH100) |
| VRAM | 48 GB GDDR6 | 80 GB HBM3 | 80 GB HBM2e |
| Memory Bandwidth | 864 GB/s | 3,350 GB/s | 2,000 GB/s |
| FP8 TFLOPS (dense) | 733 | 1,979 | 1,513 |
| FP8 TFLOPS (w/ sparsity) | 1,466 | 3,958 | 3,026 |
| BF16 TFLOPS (dense) | 362 | 989 | 756 |
| BF16 TFLOPS (w/ sparsity) | 733 | 1,979 | 1,513 |
| NVLink | None (PCIe only) | 900 GB/s (NVSwitch) | None |
| TDP | 350 W | 700 W | 350 W |
| Transformer Engine | Yes (4th-gen TC) | Yes (4th-gen TC) | Yes (4th-gen TC) |
| MIG support | No | Yes (7 instances) | Yes (7 instances) |
| Spheron on-demand $/hr | $0.72 | $3.84 | $2.01 |
The bandwidth difference is the number that matters most for inference. At 864 GB/s (L40S) vs 3,350 GB/s (H100 SXM5), H100 moves model weights and KV cache 3.9x faster. This is the primary throughput driver at low batch sizes.
Architecture: Ada Lovelace vs Hopper
FP8: Same Feature, Different Ceiling
Both L40S and H100 support FP8 via NVIDIA's Transformer Engine. The engine dynamically selects FP8 or BF16 precision per layer at runtime, using per-tensor scaling to maintain accuracy. For practical inference with vLLM, both GPUs support the same --quantization fp8 flag and produce equivalent output quality.
The gap is throughput capacity. H100 SXM5 delivers 3,958 FP8 TFLOPS with sparsity. L40S delivers 1,466. That is 2.7x more FP8 compute per GPU. At batch sizes where the workload is compute-bound (roughly batch 16+ for 7B-13B models), this advantage directly translates into more tokens per second.
For a deep dive into H100's FP8 Tensor Core architecture, the H100 complete datasheet covers all precision tiers and the Transformer Engine implementation.
Memory Subsystem: GDDR6 vs HBM3
H100 SXM5 uses HBM3 (High Bandwidth Memory, 3rd generation) at 3,350 GB/s. L40S uses GDDR6 at 864 GB/s. The technology difference matters for two reasons.
First, bandwidth. HBM stacks DRAM dies directly on the GPU package, enabling the wide memory interface that delivers 3.35 TB/s. GDDR6 uses conventional PCB-mounted DRAM with a standard memory bus, topping out at 864 GB/s on L40S.
Second, capacity. HBM's die-stacking architecture enables larger capacities in the GPU thermal envelope. H100 SXM5 packs 80GB of HBM3 on the package. L40S carries 48GB of GDDR6. For 70B parameter models, this 32GB gap is the decisive factor: 70B FP8 (70GB) fits on a single H100 but not on a single L40S.
At low batch sizes (batch 1-4), inference is memory-bandwidth-bound. GPU utilization is low; the bottleneck is how fast weights and KV cache can be loaded into SRAM for each forward pass. H100's 3.9x bandwidth advantage directly caps how much L40S can close the throughput gap here.
NVLink: Only H100 Has It
H100 SXM5 supports NVLink 4.0 at 900 GB/s total bidirectional bandwidth per GPU via NVSwitch. L40S is PCIe-only. For multi-GPU inference where tensors need to be exchanged between cards (tensor parallelism), NVLink makes a real difference. PCIe 4.0 provides roughly 64 GB/s bidirectional on a 16-lane slot, about 14x less bandwidth than NVLink.
For tensor-parallel inference on 70B models (where you're splitting the model across 2+ GPUs), H100's NVLink dramatically reduces communication overhead. For training runs with all-reduce gradients, NVLink is close to required for efficient multi-GPU scaling. The L40S's PCIe limitation is not a problem for single-card inference or embedding serving, but it is a meaningful constraint for multi-GPU training.
Benchmark: Cost Per Token at Batch 1, 8, and 32
The throughput figures below are representative estimates derived from community vLLM benchmarks and extrapolated from the L40S FP16 baseline (46 tok/s at batch 1, 336 tok/s at batch 8, Llama 3.1 8B) with FP8 approximately doubling throughput at memory-bound regimes and providing additional gains at compute-bound regimes. Results will vary with vLLM version, prompt length distribution, and system configuration. Use these as directional guidance, not authoritative benchmarks.
Cost formula: cost_per_million = (hourly_rate / tok_per_second) * (1_000_000 / 3600)
Llama 3.1 8B FP8
| Batch size | L40S tok/s | L40S $/M tokens | H100 SXM5 tok/s | H100 $/M tokens | Winner |
|---|---|---|---|---|---|
| 1 | ~90 | ~$2.22 | ~280 | ~$3.81 | L40S |
| 8 | ~650 | ~$0.31 | ~1,600 | ~$0.67 | L40S |
| 32 | ~1,500 | ~$0.13 | ~4,500 | ~$0.24 | L40S |
At batch 8, L40S at ~$0.31/M tokens is 2.16x more cost-efficient than H100 at ~$0.67/M tokens, despite H100's higher throughput. At batch 32, L40S at ~$0.13/M tokens still beats H100 at ~$0.24/M tokens. H100's 3x throughput advantage at batch 32 does not overcome the 5.33x price difference, so L40S wins on cost-per-token across all batch sizes measured here.
Qwen3 14B FP8
| Batch size | L40S tok/s | L40S $/M tokens | H100 SXM5 tok/s | H100 $/M tokens | Winner |
|---|---|---|---|---|---|
| 1 | ~40 | ~$5.00 | ~130 | ~$8.21 | L40S |
| 8 | ~290 | ~$0.69 | ~850 | ~$1.25 | L40S |
| 32 | ~800 | ~$0.25 | ~3,000 | ~$0.36 | L40S |
For 14B models, the bandwidth bottleneck is more pronounced at batch 1 and 8 because larger weights require more memory bandwidth per forward pass. L40S VRAM is sufficient (14B FP8 = ~14GB of weights, leaving 34GB for KV cache), and while the bandwidth gap means H100 delivers about 3.3x more throughput at batch 8, L40S at ~$0.69/M tokens still beats H100 at ~$1.25/M tokens because the 5.33x price difference is larger than the throughput gap.
The cost-per-token calculation here uses live GPU pricing from Spheron as of the date below. Check the pricing page for current rates before making infrastructure decisions.
Pricing fluctuates based on GPU availability. The prices above are based on 26 May 2026 and may have changed. Check current GPU pricing → for live rates.
When L40S Wins
Low-Utilization Endpoints (7B-30B Models)
The strongest case for L40S is when your inference endpoint runs at moderate, bursty, or unpredictable traffic. You pay per GPU-hour, not per token. An H100 sitting idle at $3.84/hr is more expensive than an L40S sitting idle at $0.72/hr. Even at full utilization, L40S wins on cost-per-token for 7B-30B models at typical batch sizes, so lower utilization only widens the advantage further.
For a concrete example: an internal demo endpoint serving 30 requests/hour at average 200 tokens/response is generating 6,000 output tokens/hour. Both GPUs handle this trivially; neither is anywhere near saturation. Monthly GPU cost: L40S at $0.72 720 hours = $518.40. H100 at $3.84 720 hours = $2,764.80. L40S saves $2,246.40/month.
For a detailed guide on vLLM deployment on L40S, including configuration flags and container setup, see NVIDIA L40S for AI inference.
Image Generation (SDXL, Flux)
SDXL and Flux.1 diffusion workloads are compute-bound on the UNet/DiT backbone, not memory-bandwidth-bound in the same way as autoregressive LLMs. Both L40S and H100 complete a single SDXL inference pass in broadly similar wall-clock time, with H100 having some advantage from its higher compute density.
The practical difference is small enough that for single-tenant image generation endpoints, L40S generates images at a comparable rate to H100 at 5.33x lower cost per hour. The 48GB VRAM is sufficient for SDXL at any standard resolution and for Flux.1 up to 1080p. For parallel batch image generation at high concurrency, H100's compute advantage does scale up, but most image generation deployments are latency-focused rather than batch-throughput-focused.
Embedding Serving
Embedding models (BGE, GTE, sentence-transformers) are small relative to the L40S's 48GB capacity and computationally light. A single L40S can serve millions of embeddings per hour with very low GPU utilization. Running embedding endpoints on H100 is functional but wastes most of the GPU's capabilities.
For multi-modal embedding pipelines or retrieval-augmented generation (RAG) preprocessing, L40S handles the embedding workload at $0.72/hr vs $3.84/hr for an H100. The throughput per dollar for embeddings strongly favors L40S because neither GPU is the bottleneck; the CPU-side tokenization and batching typically are.
Multi-Tenant SaaS with Per-GPU Isolation
If you're building a SaaS product that gives tenants dedicated GPU access rather than shared inference, cost per GPU-hour is your primary metric. L40S at $0.72/hr lets you offer per-GPU dedicated instances at lower price points than H100.
For 7B-14B models, one L40S per tenant is sufficient for typical serving workloads. MIG partitioning is not available on L40S, so you cannot subdivide it. But for small models, you don't need MIG; the full 48GB card handles a 14B FP8 model with plenty of KV cache headroom.
When H100 Wins
70B+ Model Inference
70B FP8 weights total 70GB. H100 SXM5's 80GB HBM3 fits the model on a single card with 10GB remaining for KV cache at short context lengths. L40S at 48GB requires two GPUs (96GB total) for 70B FP8.
For 70B+ models, a single H100 at $3.84/hr avoids the PCIe tensor-parallel overhead that 2x L40S ($1.44/hr combined) incurs. While 2x L40S is cheaper on paper at $1.44/hr, the PCIe-limited communication between cards significantly cuts throughput for large model inference, making H100's single-card performance and NVLink scaling the better choice for 70B+ workloads.
Large Batch Serving: H100 Throughput Without Cost-Per-Token Win
At batch sizes of 32 and above, the workload transitions from memory-bandwidth-bound to compute-bound. H100's 3,958 FP8 TFLOPS (2.7x the L40S's 1,466) pull ahead in raw throughput, delivering roughly 3x more tokens per second at batch 32 for 7B-13B models.
With H100 at $3.84/hr and L40S at $0.72/hr, H100 would need 5.33x the throughput of L40S to break even on cost-per-token. A 3x throughput advantage does not clear that threshold. As the benchmark tables show, L40S wins on cost-per-million-tokens at batch 32 even for compute-bound workloads. For teams where throughput ceiling matters more than cost-per-token (hard latency SLOs, burst handling), H100's raw speed may still justify the higher rate.
Training and Fine-Tuning
NVLink is the deciding factor for multi-GPU training. H100 SXM5's NVLink at 900 GB/s total bidirectional bandwidth enables efficient all-reduce gradient synchronization across a training cluster. L40S's PCIe-only connectivity caps inter-GPU bandwidth at roughly 64 GB/s, making multi-GPU gradient synchronization roughly 14x slower. For LoRA fine-tuning of 13B-70B models on 2-8 GPU setups, H100 multi-GPU training converges significantly faster than L40S multi-GPU training.
For single-GPU fine-tuning of small models (7B with LoRA or QLoRA), L40S is perfectly capable. But anything requiring multi-GPU gradient synchronization belongs on H100.
Long-Context Inference (32K+ Tokens)
KV cache size scales linearly with sequence length, layer count, and batch size. For a 13B model at 32K context with batch 8, KV cache alone can consume 20-30GB. Combined with model weights (13GB FP8), this pushes against L40S's 48GB ceiling and requires aggressive chunked-prefill or reduced batch size to prevent OOM.
H100's 80GB HBM3 handles large KV caches comfortably, and its 3.35 TB/s bandwidth processes the attention mechanism significantly faster at long contexts. For RAG pipelines with long retrieved contexts or conversational endpoints with extended history, H100 is the safer and faster choice. For long-context inference optimization see the KV cache optimization guide.
Spheron Pricing: L40S vs H100
Live prices from Spheron as of 26 May 2026:
| GPU | On-demand $/hr | Spot $/hr | Notes |
|---|---|---|---|
| L40S (per GPU) | $0.72 | $1.07 | Available as 8-GPU bare metal nodes (~$5.76/hr); spot currently above on-demand (supply quirk) |
| H100 SXM5 (per GPU) | $3.84 | $1.69 | Single-GPU instances available |
| H100 PCIe (per GPU) | $2.01 | N/A | Available in multi-GPU bundles |
For comparison, hyperscaler H100 rates:
| GPU | Provider | On-demand $/hr/GPU |
|---|---|---|
| H100 SXM5 | AWS p5.48xlarge | ~$9.80 (8x H100, $98.32/hr) |
| H100 SXM5 | Azure ND96isr H100 v5 | ~$12.36 (8x H100, $98.88/hr) |
| H100 SXM5 | Spheron | $3.84 |
| L40S | Spheron | $0.72 |
Spheron aggregates compute from 5+ providers into a single marketplace, which is how it offers rates below hyperscaler prices. Both GPUs are available on-demand with per-minute billing and SSH root access.
For batch inference workloads under 30B parameters at moderate utilization, L40S GPU rental at $0.72/hr provides capable FP8 inference at the lowest available price point. Teams running 70B+ inference or multi-GPU training workloads should rent H100 on Spheron where on-demand single-GPU availability makes it straightforward to right-size the allocation.
Pricing fluctuates based on GPU availability. The prices above are based on 26 May 2026 and may have changed. Check current GPU pricing → for live rates.
Migration Patterns
Start on L40S, Add H100 When Needed
The practical approach for teams building inference infrastructure: start on L40S for development and lower-traffic serving, then add H100 nodes when throughput demands grow. L40S and H100 instances on Spheron are provisioned independently, so you can run a mixed fleet without migrating the entire deployment.
A common pattern is routing by model size at the load balancer: small and mid-size models (7B-13B) stay on L40S, while 70B+ requests route to H100. This keeps cost-per-hour low for the majority of traffic while handling large model requests on the appropriate hardware.
The L40S vs A100 comparison has a step-by-step migration playbook that also applies when moving from L40S to H100.
Mixed Fleet Strategy
For teams serving multiple model sizes, a mixed fleet lets you right-size per workload:
- L40S nodes: embedding models, SDXL/Flux image generation, 7B-13B LLM serving at moderate traffic
- H100 nodes: 70B+ model inference, high-throughput batch serving, fine-tuning jobs
Route at the load balancer by model ID and estimated batch size. For most teams, this hybrid approach delivers lower total infrastructure cost than running everything on H100 while still having H100 capacity available for workloads that justify it.
Decision Checklist
Before choosing, work through these questions:
- Model size: is your primary model 70B+ parameters? If yes, H100 on a single card avoids PCIe tensor-parallel overhead that 2x L40S incurs.
- Batch size: what is your P50 batch size? Below 16 means you're memory-bandwidth-bound, where the L40S/H100 gap is proportional to bandwidth ratio (3.9x). Above 32 means compute-bound, where the gap is 2.7x TFLOPS. Even so, L40S wins on cost-per-token at both regimes due to the 5.33x price difference.
- Context length: are you serving prompts longer than 16K tokens at batch 8+? If yes, H100's 80GB handles KV cache headroom L40S lacks.
- GPU utilization: what is your average GPU utilization over a 24-hour window? Below 40-50% means L40S saves heavily on absolute spend, but even at full utilization L40S wins on cost-per-token for 7B-30B models.
- Training vs inference: is this a serving workload only? For fine-tuning on multi-GPU setups, H100's NVLink is decisive.
- NVLink requirement: do you need tensor-parallel or pipeline-parallel training across multiple GPUs? L40S has no NVLink.
- VRAM headroom: does your model plus KV cache at target context length and batch size fit in 48GB? Use the formula in howToStep 3 to verify.
- Spot tolerance: can your workload tolerate interruptions? Both H100 SXM5 (spot $1.69/hr) and L40S (spot $1.07/hr) have spot pricing on Spheron. Note that L40S spot is currently priced above L40S on-demand ($0.72/hr), which is unusual and likely reflects temporary supply conditions. For H100, spot at $1.69/hr is significantly below on-demand at $3.84/hr, making H100 spot an attractive option for interruption-tolerant workloads.
If your inference workload runs 7B-30B models at moderate traffic where GPU utilization stays below 50%, L40S on Spheron cuts your hourly GPU spend by 81% compared to H100 SXM5 on-demand. For high-utilization batch serving or 70B+ models, H100's throughput and single-card capacity win out. Check live availability and pricing for both.
Quick Setup Guide
Check whether your model has an FP8 checkpoint on Hugging Face (look for -fp8 or -FP8 in the model name). If not, vLLM's --quantization fp8 flag applies dynamic FP8 at load time. For models up to 30B in FP8, a single L40S 48GB card is sufficient. For 70B FP8, you need two L40S cards (2 * 48GB = 96GB > 70B * 1 byte). For 70B BF16, you need two H100 SXM5 cards.
Query your inference logs for actual batch sizes at your P50 and P95. If P50 batch is below 16, L40S is potentially more cost-efficient on an hourly basis. Run vLLM's benchmark_throughput.py with --num-prompts 200 --input-len 512 --output-len 128 at batch sizes 1, 8, 32 to measure tokens/sec on each GPU. Then compute: (hourly_rate / tokens_per_second) * 1_000_000 / 3600 to get cost per million output tokens.
KV cache per token = 2 * num_layers * num_heads * head_dim * bytes_per_element. For a 13B model at FP8 with 128 context length per request and batch 32: roughly 1.5-2GB of KV cache. This fits easily on L40S 48GB alongside the 13GB model weight. At 8K context with batch 32, KV cache grows to 15-20GB, still fitting on L40S for 13B. At 32K context, L40S VRAM becomes a constraint and H100 80GB is safer.
Check live pricing at spheron.network/pricing/ or the API at app.spheron.ai/api/gpu-offers. L40S on Spheron is available as bare metal 8-GPU nodes and also has spot pricing at $1.07/hr per GPU (note: L40S spot is currently above L40S on-demand at $0.72/hr, an unusual supply condition). H100 SXM5 on Spheron offers both on-demand ($3.84/hr) and spot ($1.69/hr). For persistent serving endpoints, compare L40S on-demand vs H100 on-demand.
Formula: cost_per_million = (hourly_rate * num_gpus / tokens_per_second) * 1_000_000 / 3600. Plug in your measured tokens/sec from Step 2 and live Spheron prices. For 7B-30B model serving at typical batch sizes, L40S wins on cost-per-token due to the 5.33x price difference outweighing H100's 2.7x-5x throughput advantage. H100 wins on cost-per-token only when single-card capacity or NVLink requirements favor it.
Frequently Asked Questions
No. H100 SXM5 has 3.35 TB/s memory bandwidth vs L40S's 864 GB/s, so H100 generates more tokens per second on all model sizes. At batch 1 for Llama 3.1 8B FP8, H100 SXM5 delivers roughly 2-3x higher throughput than L40S. At batch 32 the compute advantage (3,958 vs 1,466 FP8 TFLOPS with sparsity) compounds further. However, H100 SXM5 on Spheron costs $3.84/hr vs L40S at $0.72/hr. L40S saves roughly 81% on hourly GPU spend at on-demand rates.
Use L40S when: running image generation (SDXL, Flux) where per-image throughput is similar on both GPUs; serving embedding models at moderate scale; running low-to-medium traffic endpoints on 7B-30B models where GPU utilization stays below 50%; development and testing where you don't need H100 throughput but still need a capable inference GPU. Use H100 when model size exceeds 70B, context length is above 32K tokens, or you're running training and fine-tuning jobs requiring NVLink.
L40S delivers 864 GB/s over GDDR6. H100 SXM5 delivers 3,350 GB/s over HBM3. That is a 3.9x bandwidth advantage for H100. Bandwidth is the primary bottleneck at low batch sizes (batch 1-4), meaning H100 generates roughly 3-4x more tokens per second per GPU for single-stream requests. At batch 32+ the workload becomes more compute-bound, and H100's 2.7x FP8 TFLOPS advantage (3,958 vs 1,466 with sparsity) matters more.
Using Spheron on-demand pricing (26 May 2026) and vLLM FP8 benchmarks for Llama 3.1 8B at batch 8: L40S at $0.72/hr delivering approximately 650 tok/s works out to roughly $0.31 per million output tokens. H100 SXM5 at $3.84/hr delivering approximately 1,600 tok/s works out to roughly $0.67 per million output tokens. L40S is roughly 2.16x more cost-efficient on cost-per-token at batch 8 for this model. At batch 32, L40S at ~$0.13/M tokens still beats H100 at ~$0.24/M tokens. For 7B-13B model serving, L40S wins on cost-per-token across all common batch sizes.
Yes. Both L40S (Ada Lovelace, 4th-gen Tensor Cores) and H100 (Hopper, 4th-gen Tensor Cores) support FP8 natively via NVIDIA's Transformer Engine. The key difference is throughput ceiling: L40S delivers 1,466 FP8 TFLOPS with sparsity vs H100 SXM5's 3,958 TFLOPS. vLLM's FP8 backend works on both GPUs. Inference frameworks like vLLM, SGLang, and TensorRT-LLM all support FP8 on both.
