How many GPUs do you need to run DeepSeek V4?

DeepSeek V4 has 1 trillion total parameters with 37B active per forward pass (MoE). In FP8, the full model needs roughly 500GB VRAM minimum, so you need at least 4x H200 (141GB each = 564GB) or 8x H100 (80GB each = 640GB). A 4x H100 cluster (320GB total) is only viable with INT4 quantization. BF16 requires 8x H200 or larger.

What is expert parallelism in vLLM for MoE models?

Expert parallelism splits the MoE expert layers across multiple GPUs, so each GPU holds a subset of experts. Combined with tensor parallelism for the attention layers, it lets you efficiently distribute a large MoE model like DeepSeek V4 across a GPU cluster without bottlenecking on expert routing.

How does DeepSeek V4 compare to DeepSeek V3?

Based on pre-release information, DeepSeek V4 scales up to 1 trillion total parameters from V3's 671B, adds stronger reasoning and agentic capabilities, and improves coding benchmark scores. The active parameter count is reported as approximately 37B (some sources say ~32B) similar to V3's design, keeping inference costs manageable despite the larger total model size. The context window is expected to increase to 1M tokens, up from V3's 128K.

Can you run DeepSeek V4 on a single H100?

No. Even with aggressive INT4 quantization, DeepSeek V4 requires at minimum 4 GPUs. A single H100 (80GB) cannot hold enough of the model for meaningful inference. For production workloads, 8x H100 SXM5 or 4x H200 is the practical minimum.

Is spot GPU pricing viable for DeepSeek V4 inference?

Yes, for batch inference and non-latency-sensitive workloads. Spot instances on Spheron can reduce per-token costs by 40-60% compared to on-demand. For real-time serving where uptime matters, on-demand or reserved capacity is more appropriate.

Deploy DeepSeek V4 on GPU Cloud: MoE Inference with vLLM and Expert Parallelism

Note: DeepSeek V4 has not officially launched as of March 2026. Multiple sources indicate an April 2026 release. The specifications, benchmarks, and model weights referenced in this guide are based on pre-release information and leaks - treat all details as provisional until the official release.

This is a preparation guide for deploying DeepSeek V4 once it launches: an expected 1-trillion-parameter Mixture-of-Experts model with an estimated ~37B active parameters per token (some sources report ~32B). It is expected to set new benchmarks on coding tasks and multi-step agentic workflows, and the weights are expected to be open. This guide covers the anticipated deployment path: hardware selection, vLLM configuration with expert parallelism, and tuning for production throughput.

What Is DeepSeek V4

DeepSeek V4 is a sparse MoE model. Total parameter count is 1 trillion, but only 37B activate on any given forward pass. The model uses a top-K routing mechanism to select which experts process each token, which keeps inference compute roughly constant regardless of total model size.

Key benchmark highlights (based on leaked/claimed pre-release results - not independently verified):

Claimed top scores on HumanEval and LiveCodeBench, ahead of GPT-4o on competitive programming tasks
Strong performance on GAIA and SWE-bench multi-step agent benchmarks, with meaningful claimed gains over V3
Note: AIME 2025 and IMO-level math results widely cited in coverage of DeepSeek models were achieved by DeepSeek R1 and Math-V2, not V4; V4 math results are unconfirmed

Compared to DeepSeek V3, V4 scales up from 671B to 1T total parameters while keeping a similar active parameter design (37B per the initial leaks, though some sources report ~32B using a Top-16 routing strategy - treat this as uncertain until official release). That means you pay more in storage and weight-loading costs but not in per-token compute. The context window is reported to increase to 1M tokens via DeepSeek Sparse Attention, up from 128K in V3.

For background on why active parameter count doesn't reduce memory requirements, see our GPU memory requirements guide for LLMs.

GPU Hardware Requirements for DeepSeek V4

The full model weights at FP8 precision require approximately 500GB of VRAM. BF16 doubles that to ~1TB. You need to account for the weights, KV cache, and activation memory when picking your cluster size.

Configuration	VRAM	Quantization	Min GPUs	Notes
8x H200 SXM5	1128GB	BF16	8	Full precision; requires 1TB+ VRAM for ~1TB BF16 weights
8x H100 SXM5	640GB	FP8	8	Recommended default; fits ~500GB FP8 weights with KV cache headroom
4x H200 SXM5	564GB	FP8	4	Fewer GPUs; H200's 141GB per GPU fits FP8 weights
4x H100 SXM5	320GB	INT4	4	Budget option, some quality loss
2x H200 SXM5	282GB	INT4	2	Minimum viable for experimentation

VRAM math: DeepSeek V4's FP8 weights total approximately 500GB. The 1T parameter count includes shared attention projections and embeddings that compress well, so the actual stored weight size in FP8 is ~500GB, not a full 1TB. BF16 doubles that to ~1TB. With 8x H100 (640GB total) in FP8, you fit ~500GB of weights and keep ~140GB for KV cache. With INT4 (~250GB for expert weights), 4x H100 (320GB) is tight but feasible for short context lengths.

The active parameter count (37B) does not reduce VRAM. All 1T parameter weights must be loaded into GPU memory because the router selects different experts per token. You can't lazy-load experts without major latency spikes.

For hardware comparisons between H100 and H200, see H100 vs H200: which GPU for LLM inference. For a quick lookup of memory requirements across different models, see our GPU requirements cheat sheet 2026.

Step-by-Step Deployment with vLLM

Prerequisites

vLLM v0.14+ (current release at time of writing; expert parallelism support is available in recent versions)
CUDA 12.4+, Python 3.10+
Hugging Face account with access to the model weights (once V4 is released)
8x H100 SXM5 or 4x H200 SXM5 cluster

Install vLLM

bash

pip install vllm  # installs latest; v0.14+ recommended for MoE expert parallelism support

Download Model Weights

bash

# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID based on pre-release information.
# The actual HuggingFace repo may use a different name upon release.
huggingface-cli download deepseek-ai/DeepSeek-V4 --local-dir ./deepseek-v4

The full FP8 weights are approximately 500GB. Use a persistent storage volume so you don't re-download on instance restarts.

Launch with 8x H100 (Tensor Parallelism)

bash

# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 8 \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --host 0.0.0.0 \
  --port 8000

This splits every transformer layer across all 8 GPUs simultaneously. All 8 GPUs participate in every forward pass, which minimizes time-to-first-token at the cost of higher NVLink bandwidth usage.

Launch with Mixed Expert and Tensor Parallelism

bash

# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 \
  --port 8000

--tensor-parallel-size splits attention layers across GPUs. --enable-expert-parallel activates expert parallelism, splitting the MoE expert layers across GPUs so each GPU holds a subset of experts and the router dispatches tokens to the GPU that holds the relevant expert. The EP size is auto-calculated from TP and DP sizes when this flag is enabled.

When to use each:

Pure tensor parallelism (TP=8): better for latency-sensitive serving; all GPUs are always active
Mixed TP+EP: better for throughput-maximizing workloads where you can tolerate higher TTFT; reduces communication overhead on attention layers by using fewer GPUs per layer

Test the Endpoint

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4",  # provisional model ID; verify upon official release
    messages=[{"role": "user", "content": "Write a Python function to binary search a sorted list."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

For full vLLM setup details including Docker deployment, monitoring, and load balancing, see the vLLM production deployment guide. For setting up an OpenAI-compatible API layer for self-hosted models, see the self-hosted OpenAI-compatible API guide.

Performance Benchmarks: H100 vs H200 vs A100

These are projected figures based on architectural extrapolation from DeepSeek V3 benchmarks and published vLLM MoE performance data. Actual results will vary based on hardware configuration, batch size, and request length.

Hardware	Config	Throughput (tok/s)	p50 Latency (512 output tokens)	Notes
H100 SXM5	8x FP8	~1,800	~18s	NVLink-connected, TP=8
H200 SXM5	4x FP8	~1,400	~22s	Fewer GPUs, higher VRAM per GPU
A100 SXM4 80GB	8x INT4	~600	~52s	No hardware FP8; BF16 requires ~1TB VRAM (640GB total here is insufficient for BF16)

H100 SXM5 in 8-GPU NVLink configuration gives the best throughput. H200 in 4-GPU configuration is competitive for shorter contexts where the larger VRAM (141GB vs 80GB) reduces KV cache pressure. A100 trails significantly because it lacks the Hopper Transformer Engine required for hardware-accelerated FP8, and 8x A100 (640GB total) cannot fit the ~1TB BF16 weights - INT4 quantization is required on this configuration.

For a detailed comparison of vLLM, TensorRT-LLM, and SGLang throughput numbers, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Spheron GPU Pricing for DeepSeek V4

Prices from live Spheron GPU pricing API, fetched 02 Apr 2026:

GPU Config	On-Demand ($/hr)	Spot ($/hr)	Monthly On-Demand	Monthly Spot
4x H100 SXM5	$9.60	$3.20	$6,912	$2,304
8x H100 SXM5	$19.20	$6.40	$13,824	$4,608
4x H200 SXM5	$14.76	$5.72	$10,627	$4,118
8x H200 SXM5	$29.52	$11.44	$21,254	$8,237

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Spot instances are viable for batch inference jobs: document processing pipelines, offline evaluation, or any workload where you can handle interruptions. For real-time serving with SLA requirements, on-demand is the right choice.

Compared to proprietary API pricing for frontier models ($3-15 per million tokens), self-hosting DeepSeek V4 on 8x H100 spot breaks even at roughly 15-50 million tokens per month depending on your context length distribution.

For a full competitive pricing comparison across GPU cloud providers, see GPU cloud pricing comparison 2026.

Production Optimization

FP8 Quantization

FP8 is the right default for DeepSeek V4 on H100. Benefits over BF16:

~2x memory reduction (500GB vs 1TB for weights)
1.3-1.5x throughput improvement from Transformer Engine FP8 Tensor Cores
Minimal quality loss on standard benchmarks (typically under 1%)

Quantization decision guide:

BF16: use when accuracy is critical and you have 8x H200 (1.1TB+ VRAM)
FP8 (default): best tradeoff for production on 8x H100 or 4x H200
INT4: use only if you need to run on fewer GPUs and can accept quality degradation on precision-sensitive tasks

KV Cache Tuning

FP8 KV cache cuts memory usage by half compared to BF16 KV cache:

bash

# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 8 \
  --dtype fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

--max-model-len vs --gpu-memory-utilization tradeoff:

max-model-len	VRAM for KV cache	Max concurrent seqs (at max context)
65536	High	Low
32768	Medium	Medium
16384	Low	High

For most applications, 32768 tokens is a practical ceiling. Very few real requests use full 128K context, and reserving VRAM for maximum context length you never use leaves fewer slots for concurrent users.

For a deeper dive on PagedAttention, prefix caching, and CPU offloading, see the KV cache optimization guide.

Continuous Batching and Throughput

Enable chunked prefill for long-context requests:

bash

--enable-chunked-prefill

Tune maximum concurrent sequences for your latency target:

bash

--max-num-seqs 32   # lower = lower latency, higher = higher throughput

Monitor GPU utilization during load testing:

bash

nvidia-smi dmon -s u -d 5

Watch for GPU utilization below 70% on loaded instances: that usually means --max-num-seqs is too low and the GPU is waiting for more concurrent requests.

For additional throughput techniques including speculative decoding, see the speculative decoding production guide.

DeepSeek V4 vs V3 vs Llama 4 Maverick

Model	Total Params	Active Params	Context	Best For
DeepSeek V4 (pre-release)	1T	~37B*	1M**	Coding, agentic tasks, reasoning
DeepSeek V3	671B	37B	128K	General purpose, production-stable
Llama 4 Maverick	400B	17B	1M	Long-context, open weights

*Active parameter count is reported as 37B by some leaks and ~32B by others depending on routing strategy; treat as uncertain until official release.

**V4 context window target is 1M tokens via DeepSeek Sparse Attention (pre-release claim).

Decision guidance:

Choose DeepSeek V4 when coding quality and agentic reasoning are the primary requirements. The 1T scale shows clear gains on complex multi-step tasks and competitive programming problems.

Choose DeepSeek V3 if you want a more established model with a larger community of deployment guides, known quirks, and benchmarked configurations. V3 is production-stable; V4 is newer.

Choose Llama 4 Maverick when you need context windows beyond 64K, prefer Meta's open-weights license, or want lower active parameter counts (17B vs 37B) to reduce serving cost.

For Llama 4 deployment steps, see Deploy Llama 4 on GPU cloud. For GPU hardware selection guidance, see Best GPU for AI inference 2026.

DeepSeek V4 is one of the more demanding open-weight models to self-host, but multi-GPU clusters on Spheron make it accessible without the overhead of managing physical hardware. Spot instances cut inference costs significantly for batch workloads.
Rent H100 SXM5 → | Rent H200 SXM5 → | View all GPU pricing →
Deploy DeepSeek V4 on Spheron →

What Is DeepSeek V4

GPU Hardware Requirements for DeepSeek V4

Step-by-Step Deployment with vLLM

Prerequisites

Install vLLM

Download Model Weights

Launch with 8x H100 (Tensor Parallelism)

Launch with Mixed Expert and Tensor Parallelism

Test the Endpoint

Performance Benchmarks: H100 vs H200 vs A100

Spheron GPU Pricing for DeepSeek V4

Production Optimization

FP8 Quantization

KV Cache Tuning

Continuous Batching and Throughput

DeepSeek V4 vs V3 vs Llama 4 Maverick

Build what's next.