NVIDIA L40 vs L40S: Inference GPU Comparison, FP8 Performance, and Cloud Pricing (2026)

The NVIDIA L40 and L40S use the same AD102 silicon, the same 48 GB GDDR6 ECC memory, and the same 864 GB/s bandwidth. Both GPUs support FP8 via 4th-gen Tensor Cores. The key difference for AI inference: the L40S has roughly 2x the FP8 throughput (733 TFLOPS dense vs 362 TFLOPS dense for the L40) and ships with NVIDIA's Transformer Engine for automatic FP8/FP16 recasting in frameworks like vLLM. That gap translates to 1.5-2x higher token throughput on quantized LLM workloads, a higher hourly rate, and a 50W increase in TDP.

If you're evaluating which Ada Lovelace GPU to use for 7B-34B inference workloads, the answer comes down to how much FP8 throughput you need and whether the Transformer Engine framework is worth the premium.

TL;DR Spec Comparison

Specification	NVIDIA L40	NVIDIA L40S
Architecture	Ada Lovelace (AD102)	Ada Lovelace (AD102)
VRAM	48 GB GDDR6 ECC	48 GB GDDR6 ECC
Memory Bandwidth	864 GB/s	864 GB/s
CUDA Cores	18,176	18,176
Tensor Cores	4th gen	4th gen
Transformer Engine	No	Yes
FP32 TFLOPS	90.5	91.6
FP16 TFLOPS (dense)	181	362
FP8 TFLOPS (dense)	362 (724 sparse)	733 (1,466 sparse)
TDP	300 W	350 W
PCIe Interface	Gen4 x16	Gen4 x16
NVLink	No	No
MIG Support	No	No
Intended Workload	Professional visualization + compute	Data center AI inference + compute
Spheron Cloud Price	$0.76/hr on-demand	$0.72/hr on-demand, $0.32/hr spot

Cost-Per-Million-Tokens Comparison (Llama 3.1 8B)

Model	GPU	Precision	Batch Size	Tokens/sec	$/hr	Cost/1M tokens
Llama 3.1 8B	L40 (Runpod)	FP16	1	~35	$0.34	~$2.70
Llama 3.1 8B	L40 (Runpod)	FP16	8	~168	$0.34	~$0.56
Llama 3.1 8B	L40S (Spheron)	FP16	1	~46	$0.72	~$4.35
Llama 3.1 8B	L40S (Spheron)	FP16	8	~336	$0.72	~$0.60
Llama 3.1 8B	L40S (Spheron)	FP8	1	~70	$0.72	~$2.86
Llama 3.1 8B	L40S (Spheron)	FP8	8	~504	$0.72	~$0.40

Pricing fluctuates based on GPU availability. The prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.

Architecture: What the L40S Adds Over the L40

Both GPUs use the same AD102 die manufactured on TSMC's 4N process node. The L40S is NVIDIA's data center re-positioning of the Ada Lovelace architecture with the Transformer Engine enabled, the same hardware feature that ships on the H100. The L40 targets professional visualization and workstation use under NVIDIA's RTX Workstation line. The L40S was purpose-configured for data center AI inference.

In practice, this means the L40S physically ships identical silicon to the L40 but with the Transformer Engine microarchitecture unlocked. It is the same die, different firmware configuration and thermal envelope.

FP8 Transformer Engine

Both GPUs support FP8 via their 4th-gen Tensor Cores. The L40 delivers 362 TFLOPS FP8 dense (724 with sparsity). The L40S delivers 733 TFLOPS FP8 dense (1,466 with sparsity), roughly twice the L40's figure.

The L40S ships with NVIDIA's branded Transformer Engine, which dynamically switches between FP8 and FP16 precision per-layer during inference, based on per-tensor scaling factors computed at runtime. No manual configuration is required in vLLM when you set --dtype fp8. The engine handles the precision selection automatically, making FP8 deployment straightforward on the L40S.

The throughput impact is direct: FP8 doubles arithmetic density versus FP16 because each FP8 operation processes twice as many values per tensor core cycle. The L40S reaches 733 TFLOPS FP8 dense, versus 362 TFLOPS FP16 dense. With structured sparsity enabled, the FP8 figure reaches 1,466 TFLOPS.

In practical Llama 3.1 8B benchmarks, the L40S FP8 delivers roughly 1.5-1.8x the token throughput of the L40S FP16 at equivalent batch sizes. The L40S inference guide has a full breakdown of Transformer Engine behavior across vLLM workloads.

What the L40 Tensor Cores Support

The L40's 4th-gen Tensor Cores handle FP16, BF16, TF32, FP8, and INT8. Compared to the L40S, the L40 delivers roughly half the FP8 throughput (362 vs 733 TFLOPS dense) and lacks NVIDIA's Transformer Engine software framework for automatic FP8/FP16 layer recasting. The L40 can process FP8 workloads but at lower throughput and with more manual framework configuration compared to the L40S.

Memory and Bandwidth: Where They're Identical

Both GPUs have 48 GB GDDR6 ECC at 864 GB/s. The memory subsystem is unchanged between L40 and L40S.

For inference, bandwidth determines how quickly the GPU can load model weights from VRAM into compute units. At batch size 1-8 for 7B-13B models, 864 GB/s is sufficient for both GPUs. The performance difference at these batch sizes is almost entirely explained by the FP8 compute advantage on the L40S, not by bandwidth differences.

At higher batch sizes (16+), both GPUs become bandwidth-bound. At that point, the FP8 compute advantage of the L40S narrows relative to its theoretical TFLOPS ratio, because the bottleneck shifts to memory, where they are equal. This is why the L40S FP8 speedup over FP16 is measured at 1.5-1.8x in practice rather than the theoretical 2x implied by the compute ratio.

Both GPUs expose the full 48 GB GDDR6 ECC to GPU applications. GDDR6 ECC uses dedicated parity bits stored in separate memory cells, so the 48 GB capacity figure is not reduced. There is no VRAM capacity overhead from ECC. Models in the 43-48 GB range fit on either card. For strategies to maximize KV cache efficiency and serve more concurrent requests within 48 GB, see the KV cache optimization guide.

FP8 Inference Benchmarks

These benchmarks use vLLM latest stable. Precision: --dtype fp8 on L40S FP8 runs, --dtype float16 on all FP16 runs. L40 FP8 benchmarks are not included here: while the hardware supports FP8, the L40S Transformer Engine framework makes FP8 the practical default for vLLM deployments on that GPU, and L40 FP8 production use is uncommon.

Model	GPU	Precision	Batch Size	Tokens/sec	$/hr	Cost/1M tokens
Llama 3.1 8B	L40 (Runpod)	FP16	1	~35	$0.34	~$2.70
Llama 3.1 8B	L40 (Runpod)	FP16	8	~168	$0.34	~$0.56
Llama 3.1 8B	L40S (Spheron)	FP16	1	~46	$0.72	~$4.35
Llama 3.1 8B	L40S (Spheron)	FP16	8	~336	$0.72	~$0.60
Llama 3.1 8B	L40S (Spheron)	FP8	1	~70	$0.72	~$2.86
Llama 3.1 8B	L40S (Spheron)	FP8	8	~504	$0.72	~$0.40
Qwen 2.5 14B	L40S (Spheron)	FP8	4	~180	$0.72	~$1.11
Qwen 2.5 14B	L40S (Spheron)	FP8	8	~260	$0.72	~$0.77

Llama 3.1 70B: neither GPU handles this model comfortably on a single card at FP16. At INT4 quantization (~35 GB weights), the L40S fits the model but leaves approximately 13 GB for KV cache, which severely limits useful batch size and context length. See the GPU cost-per-token benchmark guide for 70B multi-GPU configurations.

Benchmarks reflect approximate figures from vLLM deployments on Ada Lovelace hardware. Actual throughput varies with model, sequence length, KV cache utilization, and system configuration. L40 FP16 figures estimated from hardware TFLOPS ratios and empirical L40S data. L40S FP8 figures scaled at 1.5x from FP16 measurements per Transformer Engine behavior.

For a deeper look at how continuous batching and PagedAttention scheduling affect throughput at different batch sizes, see LLM serving optimization: continuous batching and paged attention.

To set up vLLM on Spheron, see the Spheron vLLM quick-start.

NVENC, NVDEC, and Video AI Workloads

Both the L40 and L40S ship identical Ada Lovelace encoder hardware: triple NVENC 8th gen with AV1 encode and triple NVDEC with AV1 decode. For pure video transcoding pipelines (ffmpeg, Handbrake, RTMP), the two GPUs are functionally equivalent. The L40 does not lack AV1 encode. Both GPUs support it.

Where the L40S pulls ahead for video AI: generative video pipelines that run a diffusion transformer backbone before the encode step. Wan, HunyuanVideo, and FLUX.1 video all run inference on a transformer model to generate frames, then encode with NVENC. The inference step benefits from FP8. A HunyuanVideo generation job running the diffusion transformer at FP8 on the L40S will complete faster than the same job running FP16 on the L40. The NVENC encode phase that follows is identical on both GPUs.

For Stable Diffusion video (SVD, AnimateDiff) at standard FP16, the performance gap narrows. These models are more memory-bandwidth-bound during the UNet forward pass, and both GPUs share the same bandwidth. The practical difference for SVD workloads is smaller than the FP8 TFLOPS ratio suggests.

For power-constrained deployments where you only need NVENC capacity, the L40's 300W TDP is a material advantage. You get identical encode throughput at 50W less.

Training and Fine-Tuning

Neither GPU is suited for full pre-training or large-scale fine-tuning at scale. PCIe-only multi-GPU interconnect (no NVLink), no MIG support, and GDDR6 bandwidth ceiling all rule out cluster-scale training.

The L40S handles QLoRA fine-tuning of 7B-13B models on a single GPU. 48 GB GDDR6 fits a 13B 4-bit base model plus LoRA adapter weights and optimizer states with headroom. FP8 mixed-precision training via FSDP and Transformer Engine is technically possible but remains experimental in most training frameworks as of mid-2026.

The L40 supports FP8 via its Tensor Cores but lacks the Transformer Engine framework, making mixed-precision FP8 training less practical in most frameworks. For any fine-tuning workload that benefits from FP8 mixed precision, the L40S is the clear choice. For BF16/FP16 full fine-tuning of smaller models (3B-7B), the L40S has higher FP16 compute (362 vs 181 TFLOPS dense) and will complete forward and backward passes faster. However, the overall training speed difference is smaller than the 2x TFLOPS ratio implies: Adam optimizer steps dominate each gradient update by reading and writing weights, gradients, and two moment vectors, making that phase memory-bandwidth-bound. Since both GPUs share identical 864 GB/s GDDR6 bandwidth, the compute advantage narrows in wall-clock training time compared to what you see in inference throughput benchmarks. The L40S is still the faster and more capable fine-tuning GPU, but the gap is less pronounced than in inference.

Power and Form Factor

The L40 draws 300W TDP in a PCIe Gen4 x16 dual-slot form factor. The L40S draws 350W TDP in the same form factor. Both cards fit standard PCIe Gen4 x16 slots.

The 50W TDP difference matters in dense builds. A 4x GPU server with L40 cards draws 1,200W from GPUs; the same chassis with L40S cards draws 1,400W. At $0.12/kWh, that 200W difference adds approximately $17.28/month per server in electricity. Cloud providers price this cost into the hourly rate, which partly explains the L40S premium over secondary-market L40 pricing.

Cloud Pricing and Availability

The L40 PCIe is available on Spheron at $0.76/hr on-demand. The Spheron L40S is Spheron's mid-tier inference GPU, available on-demand at $0.72/hr with per-minute billing. L40S spot pricing is $0.32/hr, a 56% discount versus on-demand, suited for interruptible batch workloads.

Provider	L40 $/hr	L40S $/hr	Notes
Spheron	$0.76 on-demand	$0.72 on-demand, $0.32 spot	Per-minute billing, no minimum
Runpod	~$0.34 Community Cloud	~$0.79 Community Cloud / $0.86 Secure Cloud	Community Cloud is shared infrastructure
Vast.ai	From ~$0.22 spot	From ~$0.31 spot	Variable availability, interruptible
Lambda	Not offered	Not offered	No Ada Lovelace consumer-class GPUs
AWS	Not offered	$1.86/hr (G6e instances)	Managed service overhead included

Pricing fluctuates based on GPU availability. The prices above are based on 05 May 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader multi-provider comparison including H100, A100, and B200, see the GPU cloud pricing comparison 2026.

Decision Matrix

Scenario	Recommendation	Reason
FP8-quantized 7B-13B LLM inference at moderate concurrency	L40S	Transformer Engine delivers 1.5-2x throughput at same bandwidth
FP16 LLM inference where FP8 is unavailable	L40S still preferred	Higher FP16 TFLOPS (362 vs 181 dense)
Video transcoding only (ffmpeg, RTMP pipelines)	Either; L40 if power-constrained	Identical NVENC hardware
Generative video AI (Wan, HunyuanVideo, FLUX video)	L40S	Diffusion backbone benefits from FP8 inference
QLoRA fine-tuning of 7B-13B models	L40S	FP8 mixed-precision options; same VRAM budget
Power-constrained rack deployment (300W limit per slot)	L40	50W lower TDP, identical NVENC, FP8 not needed
High-concurrency inference (50+ concurrent requests)	H100	Neither L40 nor L40S has HBM bandwidth for this regime; see best GPU for AI inference guide
Budget batch workloads, lower FP8 throughput acceptable	L40 on Spheron ($0.76/hr) or secondary market	Available on Spheron; also on Vast.ai/Runpod at lower spot prices

The L40S is Spheron's mid-tier inference GPU: cheaper per hour than an H100, faster on FP8 than the L40, and available on-demand at per-minute billing with no contracts. It is the right starting point for most 7B-34B inference deployments before traffic justifies H100-class hardware.
L40S GPU on Spheron → | View all GPU pricing →

STEPS / 05

Quick Setup Guide

Identify your workload type
If your primary workload is FP8-quantized LLM inference (Llama, Qwen, Mistral, DeepSeek), start with the L40S - its Transformer Engine delivers 1.5-2x throughput over the L40. For graphics rendering or video encode pipelines where FP8 is unused, the L40's lower TDP may suit power-constrained environments. For video AI that uses FP8 diffusion models (Wan, HunyuanVideo), the L40S is the better fit.
Check FP8 compatibility for your model
Confirm your model supports FP8 inference before choosing the L40S. In vLLM, set --dtype fp8 and check that your model is in the supported list (Llama 3.x, Qwen 2.x/3.x, Mistral, DeepSeek V2/V3 all work). If FP8 is not supported, the L40S still outperforms the L40 via INT8, but the gap narrows.
Calculate cost-per-million-tokens for your batch size
Use: Cost per 1M tokens = ($/hr) / (tokens/sec * 3600) * 1,000,000. Run this calculation at your expected batch size using the benchmark data in this guide. Compare the L40S at live Spheron pricing against your current or target GPU choice.
Test on-demand before committing
Rent an L40S on Spheron at per-minute billing, deploy vLLM with your model, and benchmark actual throughput at your target batch size. Spheron's per-minute billing means there is no minimum commitment - a 20-minute test costs under $0.25.
Scale based on measured throughput
Once you have a measured tokens/sec figure, divide your peak request rate by (tokens/sec / average_output_length) to get the number of GPU instances needed. For L40S at FP8, plan for burst capacity by adding 20-30% headroom.

FAQ / 05

Frequently Asked Questions

The L40S has roughly 2x the FP8 throughput of the L40 (733 TFLOPS dense vs 362 TFLOPS dense) and ships with NVIDIA's Transformer Engine for automatic FP8/FP16 layer-by-layer recasting in vLLM. On FP8-quantized LLM inference (Llama 3, Qwen, Mistral), the L40S delivers 1.5-2x higher token throughput compared to the L40. Both GPUs support FP8 via 4th-gen Tensor Cores, share the Ada Lovelace architecture, 48GB GDDR6 ECC memory, 864 GB/s bandwidth, and 18,176 CUDA cores. The L40S also draws 350W versus the L40's 300W, a direct trade-off for its higher compute throughput.

Yes. The L40 supports FP8 via its 4th-gen Tensor Cores at 362 TFLOPS dense (724 with sparsity). The L40S delivers 733 TFLOPS FP8 dense - roughly 2x - and ships with NVIDIA's Transformer Engine for automatic FP8/FP16 layer-by-layer recasting in vLLM. The L40 can run FP8 workloads, but at lower throughput and without the Transformer Engine framework that makes FP8 deployment straightforward in vLLM.

For FP8 LLM inference, yes. The L40S's FP8 Transformer Engine delivers 1.5-2x the token throughput of the L40 on FP8-quantized models. If you're serving Llama 3.1 8B or Qwen 2.5 14B at scale, that throughput difference reduces your cost-per-million-tokens significantly. For video rendering or graphics workloads where FP8 is irrelevant, the L40 at lower wattage may be the better choice. For inference on Spheron, the L40S is the clear recommendation over the L40.

Both GPUs share the same Ada Lovelace NVENC encoder hardware, so for standard AV1 and H.265 encode workloads (Wan, HunyuanVideo, FLUX video), they perform equivalently. The L40S has a slight edge on generative video workloads that use FP8 inference in the diffusion backbone, but for pure encode/decode pipelines the difference is negligible. The L40's 300W TDP is an advantage in power-constrained deployments.

The H100 SXM delivers approximately 3,958 TFLOPS FP8 (with sparsity) versus the L40S's 1,466 TFLOPS FP8. In practice, Llama 3.1 8B FP8 on an H100 SXM achieves roughly 2-2.5x the throughput of the same model on an L40S. However, the H100 starts from $2.01/hr on Spheron versus the L40S from $0.72/hr. For moderate-traffic endpoints under 16 concurrent requests, the L40S cost-per-token is competitive. For high-concurrency serving at batch sizes above 16, the H100's 3.35 TB/s HBM3 bandwidth becomes the decisive factor.

TL;DR Spec Comparison

Architecture: What the L40S Adds Over the L40

FP8 Transformer Engine

What the L40 Tensor Cores Support

Memory and Bandwidth: Where They're Identical

FP8 Inference Benchmarks

NVENC, NVDEC, and Video AI Workloads

Training and Fine-Tuning

Power and Form Factor

Cloud Pricing and Availability

Decision Matrix

Quick Setup Guide

Identify your workload type

Check FP8 compatibility for your model

Calculate cost-per-million-tokens for your batch size

Test on-demand before committing

Scale based on measured throughput

Frequently Asked Questions

01What is the main difference between NVIDIA L40 and L40S for AI inference?

02Does the NVIDIA L40 support FP8 quantization?

03Is the L40S worth the premium over the L40 for LLM workloads?

04Which GPU is better for video AI workloads: L40 or L40S?

05How does the L40S compare to the H100 for FP8 inference?

Try It on Real GPUs