The L40S starts at $0.72/hr on Spheron. An H100 SXM starts at $1.33/hr on Spheron. For inference workloads running 7B–13B parameter models, those two GPUs often deliver the same output at roughly a 1.8x cost difference, and the L40S is the obvious choice.
That cost gap matters because most production LLM endpoints are not memory-bandwidth-bound at low-to-medium concurrency. The H100's 3.35 TB/s HBM3 bandwidth advantage only materializes when you're running batch sizes above 8–16 simultaneously. Below that threshold, 48 GB of GDDR6 at 864 GB/s is plenty to serve Llama 3.1 8B or Mistral NeMo 12B at interactive speeds, and the L40S does it at $0.72/hr instead of $1.33/hr.
The L40S is NVIDIA's Ada Lovelace-based data center GPU, sharing the same underlying architecture as the RTX 4090 but in a full data center form factor: ECC memory, 350 W TDP, PCIe Gen4 connectivity, and 48 GB of GDDR6 with 4th-generation Tensor Cores that support FP8 via the Transformer Engine. It is not a flagship GPU for training or large-scale throughput. It is an inference specialist built for the workloads most teams actually run.
This guide covers the L40S architecture, full specifications, LLM and image generation benchmarks versus the H100 and A100, cloud pricing across providers, and a clear breakdown of which workloads belong on an L40S versus the alternatives.
Ada Lovelace Architecture
Transformer Engine and FP8 Inference
The L40S's 4th-generation Tensor Cores include NVIDIA's Transformer Engine, which dynamically selects between FP8 and FP16 precision on a per-layer basis during inference. This matters in practice because FP8 doubles the arithmetic throughput without requiring manual precision configuration: the engine handles it automatically based on per-tensor scaling factors computed at runtime.
At TF32 precision, the L40S delivers 183.2 TFLOPS. With FP8 and the Transformer Engine's structured sparsity acceleration enabled, that figure reaches 1,466 TFLOPS, sixteen times the FP32 baseline. The frequently cited "733 TFLOPS" figure refers to FP8 without sparsity acceleration, which is the conservative hardware floor. Most vLLM deployments using the Transformer Engine operate closer to the 1,466 figure for the compute-intensive layers.
For practical LLM inference, this means the L40S can run FP8-quantized models with negligible quality loss compared to FP16, at roughly double the token throughput. The best NVIDIA GPUs for LLMs guide has a full breakdown of how quantization levels affect model quality across different architectures.
3rd-Generation RT Cores
The L40S includes 142 third-generation RT Cores, inherited from the Ada Lovelace architecture. These hardware units accelerate ray tracing for visualization and rendering workloads, including generative 3D pipelines and visual computing applications that combine AI inference with real-time rendering.
One question that comes up often: do RT Cores help with diffusion model inference? The answer is no. Standard Stable Diffusion, FLUX, and similar diffusion model implementations use Tensor Cores for the denoising UNet, not RT Cores. RT Cores only accelerate ray-triangle intersection calculations in graphics pipelines. For image generation benchmarks, the relevant specs are VRAM capacity and Tensor Core throughput, not RT Core count.
Where RT Cores matter is in hybrid workloads: rendering pipelines that combine generative AI outputs with photorealistic ray-traced compositing, or real-time visualization applications that need both AI inference and rendering acceleration in the same GPU.
GDDR6 vs HBM: The Core Trade-off
The L40S uses 48 GB of GDDR6 with ECC at 864 GB/s. The H100 SXM uses 80 GB of HBM3 at 3.35 TB/s. That 3.9x bandwidth difference is the single most important spec when choosing between them.
For LLM inference, memory bandwidth determines how fast you can load model weights from GPU memory into the Tensor Cores. At batch size 1 (serving one request at a time), the GPU spends most of its time memory-bound: loading weights for each forward pass. At this regime, 864 GB/s is sufficient for 7B–13B models at reasonable latency. At batch size 8 in FP16, the L40S achieves approximately 336 tokens/second on Llama 3.1 8B, which is competitive with most production requirements for moderate-traffic endpoints.
At batch size 16+, memory bandwidth becomes the primary bottleneck. The H100's 3.9x bandwidth advantage translates directly to proportionally higher throughput. Teams running high-concurrency inference servers with 50+ simultaneous requests should factor this into their decision.
GDDR6 has one structural advantage: cost. More VRAM per dollar is available in GDDR6 than HBM at equivalent performance tiers. The L40S's 48 GB at $0.72/hr delivers better VRAM-per-dollar than the H100 SXM at $1.33/hr for 80 GB HBM3.
Multi-GPU Limitations
The L40S connects via PCIe Gen4 x16. It does not support NVLink. For multi-GPU setups, L40S cards communicate over PCIe at approximately 32 GB/s unidirectional bandwidth per slot, compared to 900 GB/s for NVLink 4 on the H100.
For tensor-parallel inference that splits a model across multiple GPUs, this PCIe bandwidth limit is a meaningful constraint. Splitting a 13B or larger model across two L40S cards over PCIe adds inter-GPU communication overhead that doesn't exist in NVLink-connected H100 pairs. In practice, for models that fit on a single L40S (up to ~34B at INT4), multi-GPU is unnecessary. For larger models, the H100 with NVLink is the better architecture.
Full Specifications
| Specification | NVIDIA L40S |
|---|---|
| Architecture | Ada Lovelace |
| Process Node | TSMC 4N |
| VRAM | 48 GB GDDR6 with ECC |
| Memory Bandwidth | 864 GB/s |
| CUDA Cores | 18,176 |
| Tensor Cores | 4th generation (568) |
| RT Cores | 3rd generation (142) |
| FP32 Performance | 91.6 TFLOPS |
| FP16 / BF16 Performance | 362 TFLOPS (dense) / 733 TFLOPS (with sparsity) |
| FP8 / INT8 Performance | 733 TFLOPS (dense) / 1,466 TFLOPS (with structured sparsity) |
| TDP | 350 W |
| PCIe Interface | Gen4 x16 |
| MIG Support | Not supported |
| NVLink | Not supported (PCIe only) |
| Multi-GPU | PCIe peer-to-peer only |
L40S vs H100 vs A100: Side-by-Side
This is the comparison that matters for most inference decisions. Check your workload against the table, then read the narrative below for the nuance.
| Specification | L40S | H100 SXM | A100 80GB |
|---|---|---|---|
| Architecture | Ada Lovelace | Hopper | Ampere |
| Process | TSMC 4N | TSMC 4N | TSMC 7nm |
| VRAM | 48 GB GDDR6 | 80 GB HBM3 | 80 GB HBM2e |
| Memory Bandwidth | 864 GB/s | 3,350 GB/s | 2,039 GB/s |
| FP32 TFLOPS | 91.6 | 67 | 19.5 |
| FP16 TFLOPS | 362 (dense) / 733 (with sparsity) | 989 (dense) / 1,979 (with sparsity) | 312 (dense) / 624 (with sparsity) |
| FP8 TFLOPS | 1,466 (with sparsity) | 3,958 (with sparsity) | N/A (INT8: 624 TOPS dense / 1,248 TOPS with sparsity) |
| TDP | 350 W | 700 W | 400 W |
| NVLink | No (PCIe only) | Yes (NVLink 4, 900 GB/s) | Yes (NVLink 3, 600 GB/s) |
| MIG Support | No | Yes (up to 7 instances) | Yes (up to 7 instances) |
| Cloud Price/hr | from $0.72 | from $1.33 | from $0.72 |
| Best for | 7B–13B inference, image gen | High-throughput serving, training | Training, budget 70B inference |
Why the L40S wins on cost-per-token for 7B–13B inference: At batch size 1–4, GDDR6 bandwidth is sufficient to serve these models at interactive speeds. The H100's 3.35 TB/s bandwidth advantage only matters when the GPU is memory-bandwidth-bound under high concurrency. For a startup running a single Llama 3.1 8B endpoint with moderate traffic, the L40S at $0.72/hr vs the H100 at $1.33/hr represents roughly 46% cost savings for equivalent output. For more on choosing between these for different model sizes, see our GPU cloud benchmarks guide.
Why the H100 wins for high-throughput production serving: At batch 16+, memory bandwidth becomes the primary throughput bottleneck, and the H100's 3.9x bandwidth advantage compounds directly. If you are running a production API with hundreds of concurrent requests, the H100 delivers significantly more tokens per dollar at that scale. The H100 also supports MIG, enabling multi-tenant GPU sharing for cloud providers and enterprise deployments, a capability the L40S lacks.
Where the A100 fits: The A100 is the only GPU in this table with both NVLink and HBM memory at under $1/hr. It is the best option for teams fine-tuning models that need NVLink-connected multi-GPU gradient synchronization, or for running 70B models at INT4 where you need the full 80 GB at HBM2e bandwidth. For pure inference, the L40S's FP8 Tensor Core support makes it more compute-efficient than the A100 at similar pricing.
Benchmark Performance
LLM Inference Throughput
The benchmarks below use vLLM as the inference framework, which is the standard for production L40S deployments. Results are measured at steady-state throughput.
| Model | Precision | Batch Size | Tokens/sec | vs A100 80GB |
|---|---|---|---|---|
| Llama 3.1 8B | FP16 | 1 | 46 | ~0.53x (A100 80GB is ~1.9x faster) |
| Llama 3.1 8B | FP16 | 8 | 336 | ~0.52x (A100 80GB is ~1.93x faster) |
At batch size 1 in FP16, the A100 80 GB's 2,039 GB/s memory bandwidth gives it roughly 1.9x the throughput of the L40S at this memory-bound regime. Both GPUs are bandwidth-limited here, and the A100's higher memory bandwidth is the primary driver of that throughput advantage. At batch size 8 in FP16, the gap persists: the L40S delivers approximately 336 tok/s while the A100 reaches around 648 tok/s, keeping the A100 advantage at roughly 1.93x. The L40S's 4th-gen Tensor Cores do support native FP8 computation via the Transformer Engine, which the A100 lacks, and enabling FP8 in vLLM can push L40S throughput beyond the FP16 baseline. Actual FP8 gains vary by model architecture and vLLM version.
For cost comparison at batch size 8: the L40S at $0.72/hr and 336 tok/s delivers approximately 1.68 million tokens per dollar. An H100 SXM at $1.33/hr running the same model at batch size 8 delivers approximately 2.5x higher throughput (around 816 tok/s) at a 1.85x higher per-hour cost, making the H100 more token-efficient per dollar at sustained high throughput. The L40S advantage is its lower fixed hourly cost for deployments where full GPU utilization is not sustained. If your endpoint handles intermittent or low-to-moderate traffic, the L40S at $0.72/hr costs less per hour regardless of throughput ceiling.
For deeper cost-per-token analysis across GPU types, the GPU cloud benchmarks guide has cost-per-million-token calculations across the full GPU lineup.
Image Generation Performance
The L40S is a capable image generation GPU, sharing the same Ada Lovelace architecture as the RTX 4090 with slightly more raw compute (91.6 vs 82.6 TFLOPS FP32 = about 11% more) and slightly less memory bandwidth (864 vs 1,008 GB/s for the RTX 4090's GDDR6X). For single-image SDXL generation, the two GPUs deliver comparable throughput. The L40S advantage is VRAM: 48 GB vs 24 GB.
At 1024x1024 with 30 denoising steps using SDXL, the RTX 4090 takes approximately 6 seconds per image in ComfyUI. The L40S performs comparably at single-image workloads, with a slight compute advantage at larger resolutions or with more diffusion steps due to its marginally higher FP32 TFLOPS.
Where the L40S pulls clearly ahead is batch generation. With 48 GB of VRAM, the L40S can hold multiple SDXL checkpoints in memory simultaneously, run batch sizes of 8–16 images without VRAM overflow, and load ControlNet adapters or LoRA weights on top of the base SDXL model. The RTX 4090 hits VRAM limits at batch size 4–6 for 1024x1024 generation, requiring model offloading that adds latency.
For LoRA fine-tuning of diffusion models, the L40S's 48 GB similarly enables larger batch sizes and higher-resolution training than the 24 GB RTX 4090. Teams fine-tuning SDXL or FLUX LoRAs will see fewer OOM errors and faster training iterations.
Cloud Pricing and Availability
| Provider | L40S Price/hr | Notes |
|---|---|---|
| Spheron | from $0.72 | Per-minute billing, no commitment |
| Vast.ai | from $0.31 | Spot marketplace, per-second billing, variable availability |
| RunPod | from $0.79 | Community Cloud; $0.86/hr Secure Cloud on-demand |
| Modal | $1.95 | Serverless, per-second billing |
| DigitalOcean | $1.57 | On-demand |
| AWS | $1.86 | On-demand (G6e instances) |
| CoreWeave | $2.25 | On-demand |
Note: Google Cloud and Azure do not currently offer NVIDIA L40S GPUs. Among major hyperscalers, AWS (via single-GPU G6e instances) and Oracle Cloud Infrastructure (OCI, via bare-metal BM.GPU.L40S.4 instances requiring a minimum of 4 GPUs) both offer L40S access.
The spread between Spheron's $0.72/hr and enterprise cloud rates is significant. AWS charges $1.86/hr for the same hardware, while platforms like DigitalOcean and CoreWeave run $1.57 to $2.25/hr. For teams running sustained inference workloads, that gap compounds. At 720 hours per month, the difference between $0.72/hr and $1.86/hr (AWS) is over $820/month per GPU.
Spheron's per-minute billing also matters for workloads with variable demand: you don't pay for idle time between jobs. For batch inference pipelines that run overnight or on-demand triggers, this is meaningful compared to reserved or hourly-minimum billing. Check current rates at Spheron's L40S rental page and see all options at Spheron pricing.
GPU pricing fluctuates over time based on availability. Spheron pricing above reflects rates as of March 15, 2026.
Ideal Workloads for the L40S
7B–13B LLM Inference (Production Endpoints)
A single L40S running Llama 3.1 8B via vLLM can serve a production chatbot or API endpoint at ~46 tok/s at batch size 1 in FP16 and approximately 336 tok/s at batch size 8 in FP16. At $0.72/hr, this is approximately 1.68M tokens per dollar. For a product serving 100–500 concurrent users with a 7B model, one or two L40S instances handle the load at a fraction of H100 pricing.
The L40S's 48 GB is sufficient for Mistral NeMo 12B at FP16 (24 GB weights + KV cache), Mistral 7B at FP16 with generous context, and Mixtral 8x7B at INT4 (approximately 24 GB). For detailed VRAM requirements across popular models, see the GPU requirements guide for open-source AI models.
The limit is concurrency: at batch size 32+, the H100's bandwidth advantage becomes the decisive factor. For startups that are not running at HPC-scale throughput, the L40S is the economically correct choice.
AI Image Generation (Stable Diffusion, FLUX)
The L40S handles Stable Diffusion XL and FLUX.1 workloads well, with two distinct advantages over the RTX 4090: double the VRAM (48 GB vs 24 GB) and data center reliability (ECC memory, PCIe Gen4, no EULA restrictions on commercial use).
The extra VRAM enables batch generation at 8–16 images simultaneously without swapping to CPU memory, a common bottleneck for RTX 4090 deployments running production image generation APIs. It also allows loading multiple LoRA adapters or ControlNet models alongside the base SDXL checkpoint without eviction.
For teams running commercial image generation APIs, the L40S's data center form factor matters. The RTX 4090's GeForce EULA technically prohibits datacenter use; the L40S has no such restriction. See our RTX 4090 for AI workloads guide for a comparison of consumer vs data center GPUs for this use case.
Multimodal AI (Vision + Language)
The L40S's 48 GB makes it a natural fit for vision-language models like LLaVA, Qwen-VL, and InternVL. A 13B VLM with a CLIP-based visual encoder loads the visual encoder (approximately 4–6 GB) plus the language model (approximately 26 GB at FP16) into 30–32 GB total, comfortably within the L40S's 48 GB budget, with room for KV cache.
HBM-based GPUs are not required for most multimodal inference. The compute and bandwidth demands are similar to standard LLM inference: memory-bandwidth-limited at low batch sizes, compute-bound at higher concurrency. The L40S handles both regimes within its price tier, which is where most multimodal API deployments operate.
Fine-Tuning Small-to-Medium Models with QLoRA
The L40S's 48 GB VRAM enables QLoRA fine-tuning of 13B models on a single GPU: the 4-bit quantized base model occupies approximately 6.5 GB, with LoRA adapter weights, gradients, and optimizer states adding roughly 4–8 GB, for a total of 12–16 GB. For LoRA fine-tuning with a full-precision BF16 base, the 13B model requires approximately 26 GB for weights plus 4–8 GB for adapters and optimizer states, which also fits within the 48 GB budget. Full fine-tuning of 7B models in BF16 with standard AdamW requires approximately 60–70 GB (model weights, gradients, and FP32 optimizer states), which exceeds a single L40S. Using 8-bit AdamW and gradient checkpointing reduces this to approximately 35–42 GB, making full fine-tuning of 7B feasible on the L40S.
For teams that need to fine-tune frequently on domain-specific data and serve infrequently, the L40S at $0.72/hr delivers better economics than renting an H100 for the same fine-tuning job. A 13B QLoRA run that takes 4 hours costs $2.88 on an L40S vs approximately $5.32 on an H100.
The GDDR6 bandwidth limitation (864 GB/s vs 2–3.35 TB/s for HBM GPUs) affects gradient synchronization throughput for full fine-tuning at large batch sizes. For multi-GPU fine-tuning with tensor parallelism, the PCIe-only multi-GPU interconnect adds overhead not present with NVLink. For these scenarios (multi-GPU full fine-tuning of large models), the H100 with NVLink is the right choice. For everything else in the 7B–13B range with QLoRA, the L40S is sufficient. See our LLM fine-tuning guide for detailed memory and cost planning.
Whether you need the L40S for cost-efficient LLM inference, AI image generation, or multimodal workloads, Spheron provides bare-metal access starting at $0.72/hr with per-minute billing and no long-term contracts.
