The NVIDIA L40 and L40S use the same AD102 silicon, the same 48 GB GDDR6 ECC memory, and the same 864 GB/s bandwidth. Both GPUs support FP8 via 4th-gen Tensor Cores. The key difference for AI inference: the L40S has roughly 2x the FP8 throughput (733 TFLOPS dense vs 362 TFLOPS dense for the L40) and ships with NVIDIA's Transformer Engine for automatic FP8/FP16 recasting in frameworks like vLLM. That gap translates to 1.5-2x higher token throughput on quantized LLM workloads, a higher hourly rate, and a 50W increase in TDP.
If you're evaluating which Ada Lovelace GPU to use for 7B-34B inference workloads, the answer comes down to how much FP8 throughput you need and whether the Transformer Engine framework is worth the premium.
TL;DR Spec Comparison
| Specification | NVIDIA L40 | NVIDIA L40S |
|---|---|---|
| Architecture | Ada Lovelace (AD102) | Ada Lovelace (AD102) |
| VRAM | 48 GB GDDR6 ECC | 48 GB GDDR6 ECC |
| Memory Bandwidth | 864 GB/s | 864 GB/s |
| CUDA Cores | 18,176 | 18,176 |
| Tensor Cores | 4th gen | 4th gen |
| Transformer Engine | No | Yes |
| FP32 TFLOPS | 90.5 | 91.6 |
| FP16 TFLOPS (dense) | 181 | 362 |
| FP8 TFLOPS (dense) | 362 (724 sparse) | 733 (1,466 sparse) |
| TDP | 300 W | 350 W |
| PCIe Interface | Gen4 x16 | Gen4 x16 |
| NVLink | No | No |
| MIG Support | No | No |
| Intended Workload | Professional visualization + compute | Data center AI inference + compute |
| Spheron Cloud Price | $0.76/hr on-demand | $0.72/hr on-demand, $0.32/hr spot |
Cost-Per-Million-Tokens Comparison (Llama 3.1 8B)
| Model | GPU | Precision | Batch Size | Tokens/sec | $/hr | Cost/1M tokens |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | L40 (RunPod) | FP16 | 1 | ~35 | $0.34 | ~$2.70 |
| Llama 3.1 8B | L40 (RunPod) | FP16 | 8 | ~168 | $0.34 | ~$0.56 |
| Llama 3.1 8B | L40S (Spheron) | FP16 | 1 | ~46 | $0.72 | ~$4.35 |
| Llama 3.1 8B | L40S (Spheron) | FP16 | 8 | ~336 | $0.72 | ~$0.60 |
| Llama 3.1 8B | L40S (Spheron) | FP8 | 1 | ~70 | $0.72 | ~$2.86 |
| Llama 3.1 8B | L40S (Spheron) | FP8 | 8 | ~504 | $0.72 | ~$0.40 |
Pricing fluctuates based on GPU availability. The prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.
Architecture: What the L40S Adds Over the L40
Both GPUs use the same AD102 die manufactured on TSMC's 4N process node. The L40S is NVIDIA's data center re-positioning of the Ada Lovelace architecture with the Transformer Engine enabled, the same hardware feature that ships on the H100. The L40 targets professional visualization and workstation use under NVIDIA's RTX Workstation line. The L40S was purpose-configured for data center AI inference.
In practice, this means the L40S physically ships identical silicon to the L40 but with the Transformer Engine microarchitecture unlocked. It is the same die, different firmware configuration and thermal envelope.
FP8 Transformer Engine
Both GPUs support FP8 via their 4th-gen Tensor Cores. The L40 delivers 362 TFLOPS FP8 dense (724 with sparsity). The L40S delivers 733 TFLOPS FP8 dense (1,466 with sparsity), roughly twice the L40's figure.
The L40S ships with NVIDIA's branded Transformer Engine, which dynamically switches between FP8 and FP16 precision per-layer during inference, based on per-tensor scaling factors computed at runtime. No manual configuration is required in vLLM when you set --dtype fp8. The engine handles the precision selection automatically, making FP8 deployment straightforward on the L40S.
The throughput impact is direct: FP8 doubles arithmetic density versus FP16 because each FP8 operation processes twice as many values per tensor core cycle. The L40S reaches 733 TFLOPS FP8 dense, versus 362 TFLOPS FP16 dense. With structured sparsity enabled, the FP8 figure reaches 1,466 TFLOPS.
In practical Llama 3.1 8B benchmarks, the L40S FP8 delivers roughly 1.5-1.8x the token throughput of the L40S FP16 at equivalent batch sizes. The L40S inference guide has a full breakdown of Transformer Engine behavior across vLLM workloads.
What the L40 Tensor Cores Support
The L40's 4th-gen Tensor Cores handle FP16, BF16, TF32, FP8, and INT8. Compared to the L40S, the L40 delivers roughly half the FP8 throughput (362 vs 733 TFLOPS dense) and lacks NVIDIA's Transformer Engine software framework for automatic FP8/FP16 layer recasting. The L40 can process FP8 workloads but at lower throughput and with more manual framework configuration compared to the L40S.
Memory and Bandwidth: Where They're Identical
Both GPUs have 48 GB GDDR6 ECC at 864 GB/s. The memory subsystem is unchanged between L40 and L40S.
For inference, bandwidth determines how quickly the GPU can load model weights from VRAM into compute units. At batch size 1-8 for 7B-13B models, 864 GB/s is sufficient for both GPUs. The performance difference at these batch sizes is almost entirely explained by the FP8 compute advantage on the L40S, not by bandwidth differences.
At higher batch sizes (16+), both GPUs become bandwidth-bound. At that point, the FP8 compute advantage of the L40S narrows relative to its theoretical TFLOPS ratio, because the bottleneck shifts to memory, where they are equal. This is why the L40S FP8 speedup over FP16 is measured at 1.5-1.8x in practice rather than the theoretical 2x implied by the compute ratio.
Both GPUs expose the full 48 GB GDDR6 ECC to GPU applications. GDDR6 ECC uses dedicated parity bits stored in separate memory cells, so the 48 GB capacity figure is not reduced. There is no VRAM capacity overhead from ECC. Models in the 43-48 GB range fit on either card. For strategies to maximize KV cache efficiency and serve more concurrent requests within 48 GB, see the KV cache optimization guide.
FP8 Inference Benchmarks
These benchmarks use vLLM latest stable. Precision: --dtype fp8 on L40S FP8 runs, --dtype float16 on all FP16 runs. L40 FP8 benchmarks are not included here: while the hardware supports FP8, the L40S Transformer Engine framework makes FP8 the practical default for vLLM deployments on that GPU, and L40 FP8 production use is uncommon.
| Model | GPU | Precision | Batch Size | Tokens/sec | $/hr | Cost/1M tokens |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | L40 (RunPod) | FP16 | 1 | ~35 | $0.34 | ~$2.70 |
| Llama 3.1 8B | L40 (RunPod) | FP16 | 8 | ~168 | $0.34 | ~$0.56 |
| Llama 3.1 8B | L40S (Spheron) | FP16 | 1 | ~46 | $0.72 | ~$4.35 |
| Llama 3.1 8B | L40S (Spheron) | FP16 | 8 | ~336 | $0.72 | ~$0.60 |
| Llama 3.1 8B | L40S (Spheron) | FP8 | 1 | ~70 | $0.72 | ~$2.86 |
| Llama 3.1 8B | L40S (Spheron) | FP8 | 8 | ~504 | $0.72 | ~$0.40 |
| Qwen 2.5 14B | L40S (Spheron) | FP8 | 4 | ~180 | $0.72 | ~$1.11 |
| Qwen 2.5 14B | L40S (Spheron) | FP8 | 8 | ~260 | $0.72 | ~$0.77 |
Llama 3.1 70B: neither GPU handles this model comfortably on a single card at FP16. At INT4 quantization (~35 GB weights), the L40S fits the model but leaves approximately 13 GB for KV cache, which severely limits useful batch size and context length. See the GPU cost-per-token benchmark guide for 70B multi-GPU configurations.
Benchmarks reflect approximate figures from vLLM deployments on Ada Lovelace hardware. Actual throughput varies with model, sequence length, KV cache utilization, and system configuration. L40 FP16 figures estimated from hardware TFLOPS ratios and empirical L40S data. L40S FP8 figures scaled at 1.5x from FP16 measurements per Transformer Engine behavior.
For a deeper look at how continuous batching and PagedAttention scheduling affect throughput at different batch sizes, see LLM serving optimization: continuous batching and paged attention.
To set up vLLM on Spheron, see the Spheron vLLM quick-start.
NVENC, NVDEC, and Video AI Workloads
Both the L40 and L40S ship identical Ada Lovelace encoder hardware: triple NVENC 8th gen with AV1 encode and triple NVDEC with AV1 decode. For pure video transcoding pipelines (ffmpeg, Handbrake, RTMP), the two GPUs are functionally equivalent. The L40 does not lack AV1 encode. Both GPUs support it.
Where the L40S pulls ahead for video AI: generative video pipelines that run a diffusion transformer backbone before the encode step. Wan, HunyuanVideo, and FLUX.1 video all run inference on a transformer model to generate frames, then encode with NVENC. The inference step benefits from FP8. A HunyuanVideo generation job running the diffusion transformer at FP8 on the L40S will complete faster than the same job running FP16 on the L40. The NVENC encode phase that follows is identical on both GPUs.
For Stable Diffusion video (SVD, AnimateDiff) at standard FP16, the performance gap narrows. These models are more memory-bandwidth-bound during the UNet forward pass, and both GPUs share the same bandwidth. The practical difference for SVD workloads is smaller than the FP8 TFLOPS ratio suggests.
For power-constrained deployments where you only need NVENC capacity, the L40's 300W TDP is a material advantage. You get identical encode throughput at 50W less.
Training and Fine-Tuning
Neither GPU is suited for full pre-training or large-scale fine-tuning at scale. PCIe-only multi-GPU interconnect (no NVLink), no MIG support, and GDDR6 bandwidth ceiling all rule out cluster-scale training.
The L40S handles QLoRA fine-tuning of 7B-13B models on a single GPU. 48 GB GDDR6 fits a 13B 4-bit base model plus LoRA adapter weights and optimizer states with headroom. FP8 mixed-precision training via FSDP and Transformer Engine is technically possible but remains experimental in most training frameworks as of mid-2026.
The L40 supports FP8 via its Tensor Cores but lacks the Transformer Engine framework, making mixed-precision FP8 training less practical in most frameworks. For any fine-tuning workload that benefits from FP8 mixed precision, the L40S is the clear choice. For BF16/FP16 full fine-tuning of smaller models (3B-7B), the L40S has higher FP16 compute (362 vs 181 TFLOPS dense) and will complete forward and backward passes faster. However, the overall training speed difference is smaller than the 2x TFLOPS ratio implies: Adam optimizer steps dominate each gradient update by reading and writing weights, gradients, and two moment vectors, making that phase memory-bandwidth-bound. Since both GPUs share identical 864 GB/s GDDR6 bandwidth, the compute advantage narrows in wall-clock training time compared to what you see in inference throughput benchmarks. The L40S is still the faster and more capable fine-tuning GPU, but the gap is less pronounced than in inference.
Power and Form Factor
The L40 draws 300W TDP in a PCIe Gen4 x16 dual-slot form factor. The L40S draws 350W TDP in the same form factor. Both cards fit standard PCIe Gen4 x16 slots.
The 50W TDP difference matters in dense builds. A 4x GPU server with L40 cards draws 1,200W from GPUs; the same chassis with L40S cards draws 1,400W. At $0.12/kWh, that 200W difference adds approximately $17.28/month per server in electricity. Cloud providers price this cost into the hourly rate, which partly explains the L40S premium over secondary-market L40 pricing.
Cloud Pricing and Availability
The L40 PCIe is available on Spheron at $0.76/hr on-demand. The Spheron L40S is Spheron's mid-tier inference GPU, available on-demand at $0.72/hr with per-minute billing. L40S spot pricing is $0.32/hr, a 56% discount versus on-demand, suited for interruptible batch workloads.
| Provider | L40 $/hr | L40S $/hr | Notes |
|---|---|---|---|
| Spheron | $0.76 on-demand | $0.72 on-demand, $0.32 spot | Per-minute billing, no minimum |
| RunPod | ~$0.34 Community Cloud | ~$0.79 Community Cloud / $0.86 Secure Cloud | Community Cloud is shared infrastructure |
| Vast.ai | From ~$0.22 spot | From ~$0.31 spot | Variable availability, interruptible |
| Lambda | Not offered | Not offered | No Ada Lovelace consumer-class GPUs |
| AWS | Not offered | $1.86/hr (G6e instances) | Managed service overhead included |
Pricing fluctuates based on GPU availability. The prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.
For a broader multi-provider comparison including H100, A100, and B200, see the GPU cloud pricing comparison 2026.
Decision Matrix
| Scenario | Recommendation | Reason |
|---|---|---|
| FP8-quantized 7B-13B LLM inference at moderate concurrency | L40S | Transformer Engine delivers 1.5-2x throughput at same bandwidth |
| FP16 LLM inference where FP8 is unavailable | L40S still preferred | Higher FP16 TFLOPS (362 vs 181 dense) |
| Video transcoding only (ffmpeg, RTMP pipelines) | Either; L40 if power-constrained | Identical NVENC hardware |
| Generative video AI (Wan, HunyuanVideo, FLUX video) | L40S | Diffusion backbone benefits from FP8 inference |
| QLoRA fine-tuning of 7B-13B models | L40S | FP8 mixed-precision options; same VRAM budget |
| Power-constrained rack deployment (300W limit per slot) | L40 | 50W lower TDP, identical NVENC, FP8 not needed |
| High-concurrency inference (50+ concurrent requests) | H100 | Neither L40 nor L40S has HBM bandwidth for this regime; see best GPU for AI inference guide |
| Budget batch workloads, lower FP8 throughput acceptable | L40 on Spheron ($0.76/hr) or secondary market | Available on Spheron; also on Vast.ai/RunPod at lower spot prices |
The L40S is Spheron's mid-tier inference GPU: cheaper per hour than an H100, faster on FP8 than the L40, and available on-demand at per-minute billing with no contracts. It is the right starting point for most 7B-34B inference deployments before traffic justifies H100-class hardware.
