NVIDIA L40S GPU: 48GB Specs, Pricing & Rental. Rent L40S GPU from $0.67/hr
48GB GDDR6 ECC Ada Lovelace data center GPU with FP8 Tensor Cores. L40S GPU rentals tuned for inference, video, and visual AI.
You can rent an NVIDIA L40S on Spheron starting at $0.67/hr per GPU per hour, the lowest live marketplace rate. Per-minute billing, no long-term contracts, and instances deploy in under 2 minutes across data center partners in multiple regions. Each card ships with 48GB of GDDR6 ECC memory, 4th generation Tensor Cores with FP8 support, 3rd generation RT Cores, and hardware AV1 encode. The L40S is purpose-built for production inference of 7B-30B LLMs, Stable Diffusion and SDXL serving, video transcoding pipelines, and mixed AI + graphics workloads where you need data center reliability without H100 pricing.
NVIDIA L40S specifications
NVIDIA L40S pricing
| Provider | Price/hr | Savings |
|---|---|---|
SpheronYour price | $0.67/hr | - |
RunPod | $0.79/hr | 1.2x more expensive |
Lambda Labs | $1.29/hr | 1.9x more expensive |
CoreWeave | $1.89/hr | 2.8x more expensive |
AWS (g6e.xlarge) | $1.86/hr | 2.8x more expensive |
Need More L40S Than What's Listed?
Reserved Capacity
Commit to a duration, lock in availability and better rates
Custom Clusters
8 to 512+ GPUs, specific hardware, InfiniBand configs on request
Supplier Matchmaking
Spheron sources from its certified data center network, negotiates pricing, handles setup
Need more L40S capacity? Tell us your requirements and we'll source it from our certified data center network.
Typical turnaround: 24–48 hours
When to pick the L40S
Pick L40S if
You're running production inference for 7B-30B LLMs, SDXL serving, or video transcoding pipelines and need ECC + data center drivers without H100 pricing. Also the pick when you need FP8 support but don't need HBM bandwidth, and when AV1 hardware encode is on the requirements list.
Pick A100 80GB instead if
Your workload is training-heavy and bandwidth-bound. A100 has 2 TB/s HBM2e (vs 864 GB/s GDDR6 on L40S), making it faster for pre-training and fine-tuning. L40S wins at inference, A100 wins at training.
Pick RTX 4090 instead if
Your model fits in 24GB and you're running dev / testing workloads where ECC and multi-tenant isolation don't matter. RTX 4090 is roughly half the hourly rate of L40S.
Pick H100 instead if
You need HBM3 bandwidth (3.35 TB/s) or NVLink for multi-GPU tensor parallelism. H100 is the right pick for 70B+ inference or any training job where memory bandwidth is the bottleneck.
NVIDIA L40S use cases
AI Inference at Scale
Run cost-effective inference workloads with 48GB memory and INT8 support for high-throughput production deployments.
Video Processing & Encoding
Use hardware-accelerated video pipelines for live streaming, transcoding, and video analytics at scale.
Visual Computing & Rendering
Combine AI acceleration with professional graphics capabilities for rendering and visualization workloads.
Mixed AI + Graphics Workloads
Take advantage of the L40S's unique combination of AI and graphics acceleration for next-generation creative and visual AI applications.
NVIDIA L40S benchmarks
Serve Llama 3.1 8B at FP8 on L40S
L40S's 48GB GDDR6 ECC and FP8 Tensor Cores make it a strong fit for production 7B-13B inference with heavy concurrency. vLLM gives you an OpenAI-compatible endpoint in one command.
# SSH into your L40S instancessh root@<instance-ip> # Install vLLMpip install vllm # Launch Llama 3.1 8B FP8 with high concurrencyvllm serve meta-llama/Llama-3.1-8B-Instruct \ --quantization fp8 \ --max-model-len 16384 \ --max-num-seqs 64 \ --gpu-memory-utilization 0.9 # Test the endpointcurl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Hello","max_tokens":50}'For 30B models (Qwen 2.5 32B, Mixtral 8x7B at AWQ), FP8 weights still fit with room for KV cache at moderate batch sizes.
NVIDIA L40S guides and resources
GPU Cloud Benchmarks 2026
See how L40S performs against A100 and RTX 4090 in real-world benchmarks across GPU cloud providers.
Best NVIDIA GPUs for LLMs: Complete Ranking Guide
Where L40S fits in the GPU lineup for LLM inference, and when it's the right budget choice.
The GPU Cloud Cost Optimization Playbook
How to cut your AI compute bill by 60%, including when to pick L40S over pricier alternatives.
NVIDIA L40S Release Date and Cloud Availability
The NVIDIA L40S was announced at SIGGRAPH August 2023 as the data-center inference and visualization sibling of the workstation RTX 6000 Ada Generation, both built on the Ada Lovelace architecture. Production shipments began Q4 2023, with cloud availability rolling out through H1 2024. RunPod, Lambda Labs, CoreWeave, and the broader neo-cloud ecosystem had L40S capacity by mid-2024.
On Spheron the L40S is available with per-minute billing and no contract, deployed via data center partners. Live availability and pricing is on the pricing page. The L40S is the cost-efficient inference option for 7B-30B parameter LLM serving and Stable Diffusion XL pipelines; for larger models or distributed training, the H100, H200, or B200 is the step up.
L40S VRAM and Memory Bandwidth: 48GB GDDR6 ECC at 864 GB/s
The L40S ships with 48GB of GDDR6 ECC memory at 864 GB/s of bandwidth. The bandwidth is roughly 3.9x lower than the H100 SXM5 (3.35 TB/s HBM3), but the 48GB VRAM is 1.5x larger than the H100 80GB minus what FP16 weights consume. For 7B-13B model inference at low to moderate concurrency, the bandwidth gap to H100 is less impactful than the price-per-hour gap is, making the L40S the cost-efficient pick.
Where the 48GB VRAM matters: Llama 3.1 8B fits in FP16 with substantial KV cache headroom for high-concurrency serving, a 13B model fits in FP16 with smaller batches, a 30B-class model fits in INT4 quantization, and Stable Diffusion XL with multiple LoRA adapters and ControlNets runs without OOM. ECC support adds reliability for production serving that consumer GPUs lack. For 70B model inference or anything requiring NVLink for multi-GPU tensor parallelism, step up to the A100 80GB or H100 SXM5. For higher-volume single-card inference where FP4 is acceptable, the B200 spot tier is competitive.