Note: DeepSeek V4 has not officially launched as of March 2026. Multiple sources indicate an April 2026 release. The specifications, benchmarks, and model weights referenced in this guide are based on pre-release information and leaks - treat all details as provisional until the official release.
This is a preparation guide for deploying DeepSeek V4 once it launches: an expected 1-trillion-parameter Mixture-of-Experts model with an estimated ~37B active parameters per token (some sources report ~32B). It is expected to set new benchmarks on coding tasks and multi-step agentic workflows, and the weights are expected to be open. This guide covers the anticipated deployment path: hardware selection, vLLM configuration with expert parallelism, and tuning for production throughput.
What Is DeepSeek V4
DeepSeek V4 is a sparse MoE model. Total parameter count is 1 trillion, but only 37B activate on any given forward pass. The model uses a top-K routing mechanism to select which experts process each token, which keeps inference compute roughly constant regardless of total model size.
Key benchmark highlights (based on leaked/claimed pre-release results - not independently verified):
- Claimed top scores on HumanEval and LiveCodeBench, ahead of GPT-4o on competitive programming tasks
- Strong performance on GAIA and SWE-bench multi-step agent benchmarks, with meaningful claimed gains over V3
- Note: AIME 2025 and IMO-level math results widely cited in coverage of DeepSeek models were achieved by DeepSeek R1 and Math-V2, not V4; V4 math results are unconfirmed
Compared to DeepSeek V3, V4 scales up from 671B to 1T total parameters while keeping a similar active parameter design (37B per the initial leaks, though some sources report ~32B using a Top-16 routing strategy - treat this as uncertain until official release). That means you pay more in storage and weight-loading costs but not in per-token compute. The context window is reported to increase to 1M tokens via DeepSeek Sparse Attention, up from 128K in V3.
For background on why active parameter count doesn't reduce memory requirements, see our GPU memory requirements guide for LLMs.
GPU Hardware Requirements for DeepSeek V4
The full model weights at FP8 precision require approximately 500GB of VRAM. BF16 doubles that to ~1TB. You need to account for the weights, KV cache, and activation memory when picking your cluster size.
| Configuration | VRAM | Quantization | Min GPUs | Notes |
|---|---|---|---|---|
| 8x H200 SXM5 | 1128GB | BF16 | 8 | Full precision; requires 1TB+ VRAM for ~1TB BF16 weights |
| 8x H100 SXM5 | 640GB | FP8 | 8 | Recommended default; fits ~500GB FP8 weights with KV cache headroom |
| 4x H200 SXM5 | 564GB | FP8 | 4 | Fewer GPUs; H200's 141GB per GPU fits FP8 weights |
| 4x H100 SXM5 | 320GB | INT4 | 4 | Budget option, some quality loss |
| 2x H200 SXM5 | 282GB | INT4 | 2 | Minimum viable for experimentation |
VRAM math: DeepSeek V4's FP8 weights total approximately 500GB. The 1T parameter count includes shared attention projections and embeddings that compress well, so the actual stored weight size in FP8 is ~500GB, not a full 1TB. BF16 doubles that to ~1TB. With 8x H100 (640GB total) in FP8, you fit ~500GB of weights and keep ~140GB for KV cache. With INT4 (~250GB for expert weights), 4x H100 (320GB) is tight but feasible for short context lengths.
The active parameter count (37B) does not reduce VRAM. All 1T parameter weights must be loaded into GPU memory because the router selects different experts per token. You can't lazy-load experts without major latency spikes.
For hardware comparisons between H100 and H200, see H100 vs H200: which GPU for LLM inference. For a quick lookup of memory requirements across different models, see our GPU requirements cheat sheet 2026.
Step-by-Step Deployment with vLLM
Prerequisites
- vLLM v0.14+ (current release at time of writing; expert parallelism support is available in recent versions)
- CUDA 12.4+, Python 3.10+
- Hugging Face account with access to the model weights (once V4 is released)
- 8x H100 SXM5 or 4x H200 SXM5 cluster
Install vLLM
pip install vllm # installs latest; v0.14+ recommended for MoE expert parallelism supportDownload Model Weights
# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID based on pre-release information.
# The actual HuggingFace repo may use a different name upon release.
huggingface-cli download deepseek-ai/DeepSeek-V4 --local-dir ./deepseek-v4The full FP8 weights are approximately 500GB. Use a persistent storage volume so you don't re-download on instance restarts.
Launch with 8x H100 (Tensor Parallelism)
# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
--tensor-parallel-size 8 \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--host 0.0.0.0 \
--port 8000This splits every transformer layer across all 8 GPUs simultaneously. All 8 GPUs participate in every forward pass, which minimizes time-to-first-token at the cost of higher NVLink bandwidth usage.
Launch with Mixed Expert and Tensor Parallelism
# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000--tensor-parallel-size splits attention layers across GPUs. --enable-expert-parallel activates expert parallelism, splitting the MoE expert layers across GPUs so each GPU holds a subset of experts and the router dispatches tokens to the GPU that holds the relevant expert. The EP size is auto-calculated from TP and DP sizes when this flag is enabled.
When to use each:
- Pure tensor parallelism (TP=8): better for latency-sensitive serving; all GPUs are always active
- Mixed TP+EP: better for throughput-maximizing workloads where you can tolerate higher TTFT; reduces communication overhead on attention layers by using fewer GPUs per layer
Test the Endpoint
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4", # provisional model ID; verify upon official release
messages=[{"role": "user", "content": "Write a Python function to binary search a sorted list."}],
max_tokens=512,
)
print(response.choices[0].message.content)For full vLLM setup details including Docker deployment, monitoring, and load balancing, see the vLLM production deployment guide. For setting up an OpenAI-compatible API layer for self-hosted models, see the self-hosted OpenAI-compatible API guide.
Performance Benchmarks: H100 vs H200 vs A100
These are projected figures based on architectural extrapolation from DeepSeek V3 benchmarks and published vLLM MoE performance data. Actual results will vary based on hardware configuration, batch size, and request length.
| Hardware | Config | Throughput (tok/s) | p50 Latency (512 output tokens) | Notes |
|---|---|---|---|---|
| H100 SXM5 | 8x FP8 | ~1,800 | ~18s | NVLink-connected, TP=8 |
| H200 SXM5 | 4x FP8 | ~1,400 | ~22s | Fewer GPUs, higher VRAM per GPU |
| A100 SXM4 80GB | 8x INT4 | ~600 | ~52s | No hardware FP8; BF16 requires ~1TB VRAM (640GB total here is insufficient for BF16) |
H100 SXM5 in 8-GPU NVLink configuration gives the best throughput. H200 in 4-GPU configuration is competitive for shorter contexts where the larger VRAM (141GB vs 80GB) reduces KV cache pressure. A100 trails significantly because it lacks the Hopper Transformer Engine required for hardware-accelerated FP8, and 8x A100 (640GB total) cannot fit the ~1TB BF16 weights - INT4 quantization is required on this configuration.
For a detailed comparison of vLLM, TensorRT-LLM, and SGLang throughput numbers, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
Spheron GPU Pricing for DeepSeek V4
Prices from live Spheron GPU pricing API, fetched 02 Apr 2026:
| GPU Config | On-Demand ($/hr) | Spot ($/hr) | Monthly On-Demand | Monthly Spot |
|---|---|---|---|---|
| 4x H100 SXM5 | $9.60 | $3.20 | $6,912 | $2,304 |
| 8x H100 SXM5 | $19.20 | $6.40 | $13,824 | $4,608 |
| 4x H200 SXM5 | $14.76 | $5.72 | $10,627 | $4,118 |
| 8x H200 SXM5 | $29.52 | $11.44 | $21,254 | $8,237 |
Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Spot instances are viable for batch inference jobs: document processing pipelines, offline evaluation, or any workload where you can handle interruptions. For real-time serving with SLA requirements, on-demand is the right choice.
Compared to proprietary API pricing for frontier models ($3-15 per million tokens), self-hosting DeepSeek V4 on 8x H100 spot breaks even at roughly 15-50 million tokens per month depending on your context length distribution.
For a full competitive pricing comparison across GPU cloud providers, see GPU cloud pricing comparison 2026.
Production Optimization
FP8 Quantization
FP8 is the right default for DeepSeek V4 on H100. Benefits over BF16:
- ~2x memory reduction (500GB vs 1TB for weights)
- 1.3-1.5x throughput improvement from Transformer Engine FP8 Tensor Cores
- Minimal quality loss on standard benchmarks (typically under 1%)
Quantization decision guide:
- BF16: use when accuracy is critical and you have 8x H200 (1.1TB+ VRAM)
- FP8 (default): best tradeoff for production on 8x H100 or 4x H200
- INT4: use only if you need to run on fewer GPUs and can accept quality degradation on precision-sensitive tasks
KV Cache Tuning
FP8 KV cache cuts memory usage by half compared to BF16 KV cache:
# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
--tensor-parallel-size 8 \
--dtype fp8 \
--kv-cache-dtype fp8_e5m2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90--max-model-len vs --gpu-memory-utilization tradeoff:
| max-model-len | VRAM for KV cache | Max concurrent seqs (at max context) |
|---|---|---|
| 65536 | High | Low |
| 32768 | Medium | Medium |
| 16384 | Low | High |
For most applications, 32768 tokens is a practical ceiling. Very few real requests use full 128K context, and reserving VRAM for maximum context length you never use leaves fewer slots for concurrent users.
For a deeper dive on PagedAttention, prefix caching, and CPU offloading, see the KV cache optimization guide.
Continuous Batching and Throughput
Enable chunked prefill for long-context requests:
--enable-chunked-prefillTune maximum concurrent sequences for your latency target:
--max-num-seqs 32 # lower = lower latency, higher = higher throughputMonitor GPU utilization during load testing:
nvidia-smi dmon -s u -d 5Watch for GPU utilization below 70% on loaded instances: that usually means --max-num-seqs is too low and the GPU is waiting for more concurrent requests.
For additional throughput techniques including speculative decoding, see the speculative decoding production guide.
DeepSeek V4 vs V3 vs Llama 4 Maverick
| Model | Total Params | Active Params | Context | Best For |
|---|---|---|---|---|
| DeepSeek V4 (pre-release) | 1T | ~37B* | 1M** | Coding, agentic tasks, reasoning |
| DeepSeek V3 | 671B | 37B | 128K | General purpose, production-stable |
| Llama 4 Maverick | 400B | 17B | 1M | Long-context, open weights |
*Active parameter count is reported as 37B by some leaks and ~32B by others depending on routing strategy; treat as uncertain until official release.
**V4 context window target is 1M tokens via DeepSeek Sparse Attention (pre-release claim).
Decision guidance:
Choose DeepSeek V4 when coding quality and agentic reasoning are the primary requirements. The 1T scale shows clear gains on complex multi-step tasks and competitive programming problems.
Choose DeepSeek V3 if you want a more established model with a larger community of deployment guides, known quirks, and benchmarked configurations. V3 is production-stable; V4 is newer.
Choose Llama 4 Maverick when you need context windows beyond 64K, prefer Meta's open-weights license, or want lower active parameter counts (17B vs 37B) to reduce serving cost.
For Llama 4 deployment steps, see Deploy Llama 4 on GPU cloud. For GPU hardware selection guidance, see Best GPU for AI inference 2026.
DeepSeek V4 is one of the more demanding open-weight models to self-host, but multi-GPU clusters on Spheron make it accessible without the overhead of managing physical hardware. Spot instances cut inference costs significantly for batch workloads.
Rent H100 SXM5 → | Rent H200 SXM5 → | View all GPU pricing →
