Research

Best Open-Source LLMs to Self-Host in 2026: VRAM Tier Guide

best open source llms 2026self host llm vram requirementsbest local llm by vramopen source llm for 24gb gpuwhich gpu to run llama 2026LLM InferenceQuantizationGPU Cloud
Best Open-Source LLMs to Self-Host in 2026: VRAM Tier Guide

You have a GPU. You want to run an open-source LLM on it. The only real question is which model fits and performs well at your VRAM tier. This guide answers that directly, tier by tier, with actual memory math so you can stop guessing. For the underlying VRAM formula and why context length matters so much, see the GPU memory requirements for LLMs writeup. For a single-page lookup table organized by model family, the GPU requirements cheat sheet is faster.


How VRAM Tiers Work: Weights, KV Cache, and Quantization

The basic formula for weight memory is: weights_GB = params_billions × bytes_per_param. The bytes per parameter changes with precision:

  • FP16 / BF16: 2 bytes per param
  • FP8: 1 byte per param
  • INT8: 1 byte per param
  • INT4 / AWQ / NVFP4: 0.5 bytes per param

So a 70B model in FP16 needs 70 × 2 = 140 GB for weights alone. At AWQ INT4, the same model needs 70 × 0.5 = 35 GB. That is the weight floor before you add anything else.

KV cache is the variable that catches people off guard. It scales linearly with context length, batch size, number of layers, and hidden dimension. A rough rule for Llama-style architectures: at 4K context with batch 1, KV cache adds 2-4 GB. At 32K context with batch 4, it can add 20-40 GB. For MoE models like DeepSeek or GLM-5.2, the KV cache is based on the transformer backbone, not the expert count, so it can be smaller relative to total params than you expect.

Framework overhead from vLLM, SGLang, or TGI adds 0.5-2 GB on top. Budget 1 GB as a safe minimum.

For GGUF format (llama.cpp), the math is the same but quantization levels map to Q4_K_M (~0.55 bytes/param), Q5_K_M (~0.67 bytes/param), Q8_0 (~1 byte/param). GGUF is good for CPU offload and local dev; for serving throughput on CUDA hardware, vLLM with AWQ or FP8 is faster. The AWQ quantization guide goes deeper on INT4 production deployment.

One common mistake with MoE models: do not use the active parameter count for VRAM sizing. Use total params. DeepSeek V4-Flash is 284B total with ~13B active per forward pass. You still need to load all parameters into VRAM. The active-param count only tells you how much compute runs per token, not how much memory you need.


12-16GB Tier: RTX 4090, Budget Cloud GPUs

A 12-16 GB budget sounds limiting, but it covers a lot of useful ground in 2026. The key is knowing when to quantize and when to just use a smaller model natively.

Models that fit cleanly at this tier:

Qwen 3 8B at FP16 needs exactly 16 GB for weights. With a 1 GB framework buffer, you are right at the limit of a 16 GB card with almost no KV cache headroom. Run it at FP8 instead (~8 GB weights) and you get 6-8 GB of KV cache room at 8K context. At this tier, Qwen 3 8B is the strongest coding model: it outperforms most same-size alternatives on HumanEval and SWE-bench Verified.

Gemma 4 12B in FP16 needs ~24 GB, which exceeds a 16 GB card. At FP8 quantization, you land around 12 GB. The Gemma 4 family uses Google's interleaved global/local attention that benefits from larger context, so KV cache headroom matters more here than with Qwen 3.

Phi-4-mini 3.8B in FP16 needs ~8 GB and fits easily. More interesting is Phi-4 14B at AWQ INT4: ~9 GB for weights, leaving substantial KV cache headroom on a 12 GB card. Phi-4 14B INT4 outperforms most 7-8B models on reasoning tasks and is a solid pick for single-turn coding or instruction following. Phi-4 and Phi-5 share the same vLLM and AWQ setup workflow, so the Deploy Phi-5 on GPU Cloud guide applies directly to Phi-4 14B as well.

Ministral 3 8B in BF16 needs ~16 GB, right at the limit. AWQ INT4 drops it to 5-6 GB with plenty of headroom. The 8B variant includes vision capabilities across precision levels. Full deployment steps are at Deploy Ministral 3 on GPU Cloud.

LFM2 is worth noting at this tier. LFM2-2.6B at BF16 runs in just 5-6 GB, which fits on nearly anything. LFM2-8B-A1B is trickier: the hybrid SSM/attention architecture means that even with 1B active params per forward pass, total model weights sit around 18 GB in BF16. At FP8 or INT4 quantization, it can fit in 10-11 GB, but KV cache behavior differs from pure-transformer models. Check the Deploy LFM2 on GPU Cloud post for architecture-specific sizing notes.

GLM-4-9B at FP8 fits in ~9 GB. It is the compact entry in Z.ai's model family and a decent coding alternative at this tier.

Model selection table for the 12-16GB tier:

ModelQuantizationVRAM (weights)Best Use Case
Qwen 3 8BFP8~8 GBCoding, chat, general purpose
Qwen 3 8BFP16~16 GBHigher fidelity, tight fit
Gemma 4 12BFP8~12 GBMultimodal, long context
Phi-4-mini 3.8BFP16~8 GBFast inference, minimal footprint
Phi-4 14BAWQ INT4~9 GBReasoning above 8B quality
Ministral 3 8BAWQ INT4~5-6 GBVision + instruction following
LFM2-2.6BBF16~5-6 GBLow VRAM footprint
GLM-4-9BFP8~9 GBCoding alternative

GPU note on the RTX 4090: NVIDIA's GeForce EULA technically prohibits using GeForce GPUs in data centers for production serving. For personal dev boxes or research this does not matter, but if you are running a production inference endpoint, the L40S is the enterprise-certified alternative with 48 GB VRAM and full data center compliance. RTX 4090 on Spheron starts at $0.77/hr for the cases where EULA is not a concern.

Pricing fluctuates based on GPU availability. The prices above are based on 23 Jun 2026 and may have changed. Check current GPU pricing → for live rates.


24GB Tier: RTX 5090, Single A100 PCIe

At 24 GB you get meaningful headroom over the 12-16 GB tier. The 32 GB RTX 5090 GDDR7 extends this further still, and a single A100 PCIe (40 GB or 80 GB) opens different trade-offs.

Models that fit at 24-32GB:

Qwen 3 32B at AWQ INT4 needs ~16 GB for weights. On a 24 GB card this gives you 8 GB of KV cache headroom, which translates to roughly 8K-16K context at batch 2-4. This is the top coding and reasoning pick at this tier. For most engineering tasks, Qwen 3 32B INT4 on a 24 GB card outperforms 7B FP16 models by a significant margin.

Phi-4 14B at BF16 needs ~28 GB. That exceeds a 24 GB card but fits on the RTX 5090 (32 GB) with a few GB to spare for KV cache. On a standard 24 GB card, use AWQ INT4 (~9 GB) instead and spend the saved VRAM on context.

Gemma 4 31B Dense at INT4 lands at ~16-18 GB. Solid multimodal pick if you need image understanding at this tier with room for KV cache.

Ministral 3 8B in BF16 (~16 GB) fits comfortably on a 24 GB card with 8 GB left for KV cache. For shorter context workloads with vision, this works well.

LFM2-8B-A1B at BF16 plus KV cache overhead lands at around 18-19 GB. The SSM state memory is largely constant rather than growing with context like a standard KV cache, which makes LFM2 particularly efficient at long sequences. It fits on 24 GB with room for extended context.

Qwen 3.5 9B at FP16 needs ~18 GB, leaving 6 GB for KV cache on a 24 GB card. The Qwen 3.5 family made notable improvements on math and multilingual tasks over the original Qwen 3 series. Full deployment walkthrough at Deploy Qwen 3.5 on GPU Cloud.

Qwen 3.5 27B at FP8 needs ~27 GB. This just misses a 24 GB card but runs comfortably on the RTX 5090's 32 GB, with 5 GB of headroom for KV cache.

Model selection table for the 24-32GB tier:

ModelFormatVRAM (weights)Min GPU
Qwen 3 32BAWQ INT4~16 GBRTX 4090 / RTX 5090
Gemma 4 31BINT4~16-18 GBRTX 4090 / RTX 5090
Ministral 3 8BBF16~16 GBRTX 4090
Qwen 3.5 9BFP16~18 GBRTX 4090 / RTX 5090
LFM2-8B-A1BBF16~18-19 GBRTX 4090 / RTX 5090
Phi-4 14BBF16~28 GBRTX 5090 (32 GB)
Qwen 3.5 27BFP8~27 GBRTX 5090 (32 GB)

RTX 5090 instances on Spheron are available when in stock; check current GPU pricing → for the latest rate. For the 32B INT4 workloads especially, the extra 8 GB over a 24 GB card makes a meaningful difference in practical context length.

Pricing fluctuates based on GPU availability. The prices above are based on 23 Jun 2026 and may have changed. Check current GPU pricing → for live rates.


48GB Tier: L40S, RTX PRO 6000

48 GB is where 70B-class models become practical at INT4. You can also run 30B models at FP8 with real context headroom. This tier is the sweet spot for production single-GPU inference on mid-to-large open-source models.

Models that unlock at 48GB:

Qwen2.5 72B at AWQ INT4 needs ~36-40 GB for weights. On an L40S (48 GB) this leaves 8-12 GB for KV cache, enough for 32K-48K context at batch 2. This is one of the most cost-effective ways to serve a frontier-tier open-source model in single-GPU production.

Llama 3.3 70B at INT4 follows the same math: ~35-40 GB for weights. Strong multilingual and instruction-following performance. If you need Meta's licensing terms or prioritize multilinguality, this is the pick at this quantization level.

Phi-4 14B in BF16 (~28 GB) fits with abundant KV cache headroom on a 48 GB card. You can push to 64K+ context at batch 4 without hitting VRAM limits.

Gemma 4 31B Dense at FP8 needs ~31 GB, leaving 17 GB for KV cache on the L40S. At BF16 (~62 GB), it exceeds the L40S capacity but fits on the RTX PRO 6000 Blackwell (96 GB GDDR7 ECC). The RTX PRO 6000 also covers most of the 80GB+ tier use cases described in the next section due to its larger VRAM.

Qwen 3.5 35B-A3B MoE at FP8 loads roughly 35 GB total model weights. Despite only ~3B active params per forward pass, you still need all expert weights resident in VRAM.

At 48 GB, you can serve 70B INT4 models at 32K-64K context lengths without aggressive KV cache eviction. That context range opens up retrieval-augmented generation, longer code context, and document QA workloads that are hard to serve cleanly on smaller cards.

The L40S is an Ada Lovelace-based data center GPU, enterprise certified with no EULA issues. See the NVIDIA L40S inference guide for throughput benchmarks and optimal vLLM config.

The RTX PRO 6000 Blackwell at 96 GB GDDR7 ECC sits above the L40S in this tier and extends into 80GB+ territory. The RTX PRO 6000 rental guide covers its architecture differences and production use cases in detail.

Pricing at this tier:

GPUVRAMOn-Demand $/hr
L40S48 GB GDDR6$0.96
RTX PRO 600096 GB GDDR7 ECC$2.35

L40S GPU rental at $0.96/hr on-demand. RTX PRO 6000 instances on Spheron at $2.35/hr on-demand.

Pricing fluctuates based on GPU availability. The prices above are based on 23 Jun 2026 and may have changed. Check current GPU pricing → for live rates.


80GB+ Tier: H100, H200, B200 - Full Models and Long Context

Once you clear 80 GB, the conversation shifts from fitting the model to optimizing context length and throughput.

H100 SXM5 (80 GB HBM3):

Qwen2.5 72B at FP8 needs ~72 GB for weights, leaving 8 GB for KV cache. At batch 1 with 8K context, this works. Push to batch 4 at 32K and you will start hitting limits. For production serving with real concurrency, the H100 works best with Qwen2.5 72B at INT4 (~36-40 GB), leaving 40+ GB for KV cache. Llama 4 Scout at 109B total params in INT4 lands around 55 GB and fits on a single H100.

The H100 with its 3.35 TB/s HBM3 bandwidth is substantially faster for autoregressive decode than any GDDR-based GPU. Bandwidth matters more than raw compute for decode-dominated workloads. For form factor differences, the H100 NVL vs SXM5 vs PCIe guide breaks down the trade-offs.

H200 SXM5 (141 GB HBM3e):

141 GB is not enough for Qwen2.5 72B in FP16: the weights alone total 144 GB, which already exceeds H200's capacity before any KV cache or framework overhead. For FP16 72B serving, 2x H100 is the minimum. H200 handles 70B at INT4 (~35 GB weights) with 100+ GB of KV cache headroom, or at FP8 (~70 GB weights) with approximately 70 GB of KV cache headroom. The H200's 4.8 TB/s memory bandwidth is 43% above the H100, which shows up directly in decode throughput on memory-bound autoregressive generation. For a head-to-head VRAM and throughput comparison with AMD's flagship, see the AMD MI300X vs H200 comparison.

128K context with batch 4+ at 70B FP8 is realistic on a single H200. That is the actual production capability that makes H200 interesting for long-document and agentic workloads.

B200 SXM6 (192 GB HBM3e):

Llama 4 Maverick at INT4 is around 200 GB total weight footprint. A single B200 (192 GB HBM3e) cannot fit it on its own. You need 2x B200 in tensor parallel or a single B300 (288 GB) for comfortable headroom. The B200 also enables native FP4 (NVFP4) inference via its Blackwell tensor cores, cutting 70B model weights to ~35 GB and enabling massive batching headroom. B200 on Spheron is currently spot-only at around $5.37/hr; spot capacity can be reclaimed without notice, so factor that into workload planning.

Multi-GPU and massive MoE models:

DeepSeek V4-Flash at FP8 needs approximately 284 GB. At INT4 (~142 GB), you need 2x H100s in tensor parallel. The full deployment walkthrough is at Deploy DeepSeek V4 on GPU Cloud.

GLM-5.2 at 744B MoE parameters is larger still. Expert parallelism across 8x H200 minimum is the practical floor for production serving. See Deploy GLM-5.2 on GPU Cloud for the routing and parallelism config.

Recommended GPUs and pricing:

GPUVRAMOn-Demand $/hrSpot $/hr
H100 SXM580 GB HBM3$4.06$2.91
H200 SXM5141 GB HBM3e$4.84$3.36
B200 SXM6192 GB HBM3en/a$5.37
B300 SXM6288 GB HBM3e~$9.16 (est.)n/a

For guaranteed capacity, H100 SXM5 instances at $4.06/hr or H200 SXM5 instances at $4.84/hr are solid options. B200 SXM6 is currently spot-only at around $5.37/hr (spot can be reclaimed without notice).

Pricing fluctuates based on GPU availability. The prices above are based on 23 Jun 2026 and may have changed. Check current GPU pricing → for live rates.


Quick-Reference Model-to-GPU Matrix

ModelParamsFormatVRAM Est.Min Spheron GPU$/hrBest Use
LFM2-2.6B2.6BBF16~5-6 GBRTX 4090$0.77Minimal footprint
Phi-4-mini 3.8B3.8BFP16~8 GBRTX 4090$0.77Fast edge inference
Ministral 3 8B8BAWQ INT4~5-6 GBRTX 4090$0.77Vision + instruction
GLM-4-9B9BFP8~9 GBRTX 4090$0.77Compact coding
Gemma 4 12B12BFP8~12 GBRTX 4090$0.77Multimodal
Phi-4 14B14BAWQ INT4~9 GBRTX 4090$0.77Reasoning above 8B quality
Qwen 3 8B8BFP8~8 GBRTX 4090$0.77Coding, general
Qwen 3 8B8BFP16~16 GBRTX 4090$0.77Higher fidelity
Qwen 3 32B32BAWQ INT4~16 GBRTX 4090$0.77Best coding/reasoning at tier
Gemma 4 31B31BINT4~16-18 GBRTX 4090$0.77Multimodal mid-tier
Qwen 3.5 9B9BFP16~18 GBRTX 4090$0.77Math, multilingual
LFM2-8B-A1B8B totalBF16~18-19 GBRTX 4090$0.77Long context SSM
Phi-4 14B14BBF16~28 GBRTX 5090n/aLong-context reasoning
Qwen 3.5 27B27BFP8~27 GBRTX 5090 (32 GB)n/aQuality mid-size
Qwen2.5 72B72BAWQ INT4~36-40 GBL40S$0.96Production 70B
Llama 3.3 70B70BAWQ INT4~35-40 GBL40S$0.96Multilingual, RAG
Qwen 3.5 35B-A3B35B MoEFP8~35 GBL40S$0.96Long context MoE
Qwen2.5 72B72BFP8~72 GBH100 SXM5$4.06Full-quality 80GB
Llama 4 Scout109B MoEINT4~55 GBH100 SXM5$4.06Large MoE single-GPU
Qwen2.5 72B72BFP16~144 GB2x H100$8.12Max fidelity, research

Pricing fluctuates based on GPU availability. The prices above are based on 23 Jun 2026 and may have changed. Check current GPU pricing → for live rates.


Cost Per Million Tokens by Tier

The formula is: CPM ($) = (GPU $/hr) / (tokens_per_sec × 3600 / 1,000,000)

Or equivalently: CPM = GPU_cost_per_hour / (throughput_tps × 0.0036)

Rough throughput estimates for decode-phase generation at batch 1:

GPUModelFormatThroughput (est. tps)$/hrCPM ($)
RTX 4090Qwen 3 8BFP8~120$0.77~$1.78
RTX 4090Phi-4 14BAWQ INT4~70$0.77~$3.06
L40SQwen2.5 72BAWQ INT4~30$0.96~$8.89
A100 80G SXM4Qwen2.5 72BAWQ INT4~45$1.69~$10.43
H100 SXM5Qwen2.5 72BFP8~90$4.06~$12.53
H200 SXM5Qwen2.5 72BFP8~130$4.84~$10.34

A few things stand out in this table. The L40S at $0.96/hr serving Qwen2.5 72B INT4 produces the lowest CPM among all 70B+ options. INT4 quantization cuts the memory bandwidth pressure enough to partially compensate for GDDR6 bandwidth limits. The H200 closes the gap with higher throughput at FP8, while the H100 carries the highest CPM at batch 1 despite its bandwidth advantage.

Throughput at batch sizes above 4-8 narrows the gap between GPU tiers further. For batch inference and async queues, larger GPUs with HBM often win on throughput per dollar even at higher hourly rates. The AI inference cost economics 2026 post covers batch throughput scaling in more detail.

These are estimates at batch 1 decode under typical conditions. Real throughput varies with prefill ratio, sequence length distribution, and serving framework configuration. Benchmark your actual workload before committing to a tier.


When to Rent Bigger vs Quantize Smaller

A few concrete decision rules:

Quantize down if your eval scores pass at INT4. Many chatbot and coding tasks show no measurable regression from FP16 to INT4. Run your benchmark suite at INT4 on the target model before assuming you need FP8 or BF16.

Go bigger if you need 64K+ context at batch 4 or higher. KV cache at long context with real concurrency will eat your VRAM budget faster than model weights. An L40S running Qwen2.5 72B INT4 at 128K context is more constrained than an H100 running the same model.

Consider spot on a bigger tier vs on-demand on a smaller one. H100 spot at $2.91/hr is cheaper than H100 on-demand at $4.06/hr. For batch workloads and async queues, H100 spot often delivers better throughput-per-dollar than L40S on-demand, because you can serve FP8 with full KV cache headroom instead of INT4 with tight margins.

Go up one VRAM tier if your primary model is BF16 or FP8 and you care about generation quality on structured output, math reasoning, or tasks where INT4 regressions are known. The quality difference between FP8 and INT4 is more noticeable on chain-of-thought and tool-use tasks than on simple chat.

Use fractional instances for small models. If you are serving a 3B or 8B model, a full L40S is overkill. An MPS-partitioned fractional instance shares GPU resources at lower cost. The fractional GPU inference guide covers how to right-size for small models without paying for idle VRAM.


Every tier in this guide maps to a live Spheron GPU you can spin up today. Per-minute billing, no commitment, and 5+ providers aggregated for availability across every GPU model listed here.

H100 on Spheron → | L40S capacity → | RTX 5090 instances → | View all pricing →

STEPS / 04

Quick Setup Guide

  1. Calculate base VRAM requirement for your model

    Multiply parameter count by bytes per parameter: FP16/BF16 = 2 bytes, FP8/INT8 = 1 byte, INT4/AWQ = 0.5 bytes. For a 14B model at INT4: 14 × 0.5 = 7GB for weights. For MoE models, always use total parameter count not active parameter count.

  2. Add KV cache and overhead budget

    Budget 1GB for framework overhead (vLLM, SGLang, etc.). KV cache scales with context length and batch size. At 4K context and batch 1, add 1-4GB. At 32K context and batch 4, add 15-40GB depending on model architecture. Pick a GPU with at least 20% headroom above your total estimate.

  3. Choose quantization to fit a smaller tier

    If your model fits at FP8 but not FP16, start there. FP8 typically shows no quality regression on most tasks. If FP8 still doesn't fit, try AWQ INT4 (preferred over GPTQ for accuracy) or GGUF Q4_K_M for llama.cpp. Always run your eval benchmark at the quantization level before deploying to production.

  4. Provision and benchmark on Spheron

    Pick the smallest GPU tier where your model fits with KV cache headroom. Deploy via Spheron (app.spheron.ai), run your inference server (vLLM, SGLang, or Ollama), and benchmark tokens-per-second at your expected batch size and context length. Adjust quantization or GPU tier based on latency and throughput results.

FAQ / 05

Frequently Asked Questions

Several strong options fit in 12-16GB. Qwen 3 8B at FP8 needs ~8GB for weights. Phi-4 14B at AWQ INT4 lands around 9GB. Ministral 3 8B at AWQ INT4 uses 5-6GB. GLM-4-9B at FP8 runs in ~9GB. The RTX 4090 (24GB) is the most common cloud GPU for this tier, giving you headroom for KV cache on top of the model weights.

Yes, at INT4 quantization. Qwen2.5 72B at AWQ INT4 needs approximately 36-40GB for weights. On an L40S (48GB), that leaves 8-12GB for KV cache, which supports roughly 32K-48K context at batch 2. Llama 3.3 70B INT4 follows the same math. For 70B at FP8 (~72GB), you need a single H100 (80GB) or larger.

It depends on what you are optimizing for. For quality at moderate context: Qwen2.5 72B at FP8 (~72GB weights) with limited KV cache headroom, or at AWQ INT4 (~36-40GB) with 40GB of KV cache room for long context serving. For something larger: Llama 4 Scout at INT4 fits in ~55GB on a single H100. For frontier quality: Qwen2.5 72B FP8 is close to FP16 on most benchmarks and makes efficient use of the 80GB HBM.

Quantization reduces the bytes stored per parameter. FP16 and BF16 use 2 bytes per param. FP8 and INT8 use 1 byte (50% reduction). INT4 and AWQ INT4 use 0.5 bytes (75% reduction). So a 70B model goes from 140GB at FP16 to 70GB at FP8 to 35GB at INT4. KV cache size is not directly affected by weight quantization, though FP8 KV cache quantization can halve it separately.

The RTX 4090 (24GB) at $0.77/hr on Spheron is the entry point. It runs Qwen 3 8B at FP8 (~8GB), Phi-4 14B at INT4 (~9GB), and Qwen 3 32B at INT4 (~16GB) comfortably. For 70B-class models, the L40S (48GB) at $0.96/hr is the cheapest option, running Qwen2.5 72B or Llama 3.3 70B at INT4.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.