Tutorial

Deploy DeepSeek V4 on GPU Cloud: MoE Inference with vLLM and Expert Parallelism

Back to BlogWritten by Mitrasish, Co-founderApr 2, 2026
DeepSeek V4LLM DeploymentvLLMGPU CloudMoEExpert ParallelismH100H200
Deploy DeepSeek V4 on GPU Cloud: MoE Inference with vLLM and Expert Parallelism

Note: DeepSeek V4 has not officially launched as of March 2026. Multiple sources indicate an April 2026 release. The specifications, benchmarks, and model weights referenced in this guide are based on pre-release information and leaks - treat all details as provisional until the official release.

This is a preparation guide for deploying DeepSeek V4 once it launches: an expected 1-trillion-parameter Mixture-of-Experts model with an estimated ~37B active parameters per token (some sources report ~32B). It is expected to set new benchmarks on coding tasks and multi-step agentic workflows, and the weights are expected to be open. This guide covers the anticipated deployment path: hardware selection, vLLM configuration with expert parallelism, and tuning for production throughput.

What Is DeepSeek V4

DeepSeek V4 is a sparse MoE model. Total parameter count is 1 trillion, but only 37B activate on any given forward pass. The model uses a top-K routing mechanism to select which experts process each token, which keeps inference compute roughly constant regardless of total model size.

Key benchmark highlights (based on leaked/claimed pre-release results - not independently verified):

  • Claimed top scores on HumanEval and LiveCodeBench, ahead of GPT-4o on competitive programming tasks
  • Strong performance on GAIA and SWE-bench multi-step agent benchmarks, with meaningful claimed gains over V3
  • Note: AIME 2025 and IMO-level math results widely cited in coverage of DeepSeek models were achieved by DeepSeek R1 and Math-V2, not V4; V4 math results are unconfirmed

Compared to DeepSeek V3, V4 scales up from 671B to 1T total parameters while keeping a similar active parameter design (37B per the initial leaks, though some sources report ~32B using a Top-16 routing strategy - treat this as uncertain until official release). That means you pay more in storage and weight-loading costs but not in per-token compute. The context window is reported to increase to 1M tokens via DeepSeek Sparse Attention, up from 128K in V3.

For background on why active parameter count doesn't reduce memory requirements, see our GPU memory requirements guide for LLMs.

GPU Hardware Requirements for DeepSeek V4

The full model weights at FP8 precision require approximately 500GB of VRAM. BF16 doubles that to ~1TB. You need to account for the weights, KV cache, and activation memory when picking your cluster size.

ConfigurationVRAMQuantizationMin GPUsNotes
8x H200 SXM51128GBBF168Full precision; requires 1TB+ VRAM for ~1TB BF16 weights
8x H100 SXM5640GBFP88Recommended default; fits ~500GB FP8 weights with KV cache headroom
4x H200 SXM5564GBFP84Fewer GPUs; H200's 141GB per GPU fits FP8 weights
4x H100 SXM5320GBINT44Budget option, some quality loss
2x H200 SXM5282GBINT42Minimum viable for experimentation

VRAM math: DeepSeek V4's FP8 weights total approximately 500GB. The 1T parameter count includes shared attention projections and embeddings that compress well, so the actual stored weight size in FP8 is ~500GB, not a full 1TB. BF16 doubles that to ~1TB. With 8x H100 (640GB total) in FP8, you fit ~500GB of weights and keep ~140GB for KV cache. With INT4 (~250GB for expert weights), 4x H100 (320GB) is tight but feasible for short context lengths.

The active parameter count (37B) does not reduce VRAM. All 1T parameter weights must be loaded into GPU memory because the router selects different experts per token. You can't lazy-load experts without major latency spikes.

For hardware comparisons between H100 and H200, see H100 vs H200: which GPU for LLM inference. For a quick lookup of memory requirements across different models, see our GPU requirements cheat sheet 2026.

Step-by-Step Deployment with vLLM

Prerequisites

  • vLLM v0.14+ (current release at time of writing; expert parallelism support is available in recent versions)
  • CUDA 12.4+, Python 3.10+
  • Hugging Face account with access to the model weights (once V4 is released)
  • 8x H100 SXM5 or 4x H200 SXM5 cluster

Install vLLM

bash
pip install vllm  # installs latest; v0.14+ recommended for MoE expert parallelism support

Download Model Weights

bash
# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID based on pre-release information.
# The actual HuggingFace repo may use a different name upon release.
huggingface-cli download deepseek-ai/DeepSeek-V4 --local-dir ./deepseek-v4

The full FP8 weights are approximately 500GB. Use a persistent storage volume so you don't re-download on instance restarts.

Launch with 8x H100 (Tensor Parallelism)

bash
# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 8 \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --host 0.0.0.0 \
  --port 8000

This splits every transformer layer across all 8 GPUs simultaneously. All 8 GPUs participate in every forward pass, which minimizes time-to-first-token at the cost of higher NVLink bandwidth usage.

Launch with Mixed Expert and Tensor Parallelism

bash
# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 \
  --port 8000

--tensor-parallel-size splits attention layers across GPUs. --enable-expert-parallel activates expert parallelism, splitting the MoE expert layers across GPUs so each GPU holds a subset of experts and the router dispatches tokens to the GPU that holds the relevant expert. The EP size is auto-calculated from TP and DP sizes when this flag is enabled.

When to use each:

  • Pure tensor parallelism (TP=8): better for latency-sensitive serving; all GPUs are always active
  • Mixed TP+EP: better for throughput-maximizing workloads where you can tolerate higher TTFT; reduces communication overhead on attention layers by using fewer GPUs per layer

Test the Endpoint

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4",  # provisional model ID; verify upon official release
    messages=[{"role": "user", "content": "Write a Python function to binary search a sorted list."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

For full vLLM setup details including Docker deployment, monitoring, and load balancing, see the vLLM production deployment guide. For setting up an OpenAI-compatible API layer for self-hosted models, see the self-hosted OpenAI-compatible API guide.

Performance Benchmarks: H100 vs H200 vs A100

These are projected figures based on architectural extrapolation from DeepSeek V3 benchmarks and published vLLM MoE performance data. Actual results will vary based on hardware configuration, batch size, and request length.

HardwareConfigThroughput (tok/s)p50 Latency (512 output tokens)Notes
H100 SXM58x FP8~1,800~18sNVLink-connected, TP=8
H200 SXM54x FP8~1,400~22sFewer GPUs, higher VRAM per GPU
A100 SXM4 80GB8x INT4~600~52sNo hardware FP8; BF16 requires ~1TB VRAM (640GB total here is insufficient for BF16)

H100 SXM5 in 8-GPU NVLink configuration gives the best throughput. H200 in 4-GPU configuration is competitive for shorter contexts where the larger VRAM (141GB vs 80GB) reduces KV cache pressure. A100 trails significantly because it lacks the Hopper Transformer Engine required for hardware-accelerated FP8, and 8x A100 (640GB total) cannot fit the ~1TB BF16 weights - INT4 quantization is required on this configuration.

For a detailed comparison of vLLM, TensorRT-LLM, and SGLang throughput numbers, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Spheron GPU Pricing for DeepSeek V4

Prices from live Spheron GPU pricing API, fetched 02 Apr 2026:

GPU ConfigOn-Demand ($/hr)Spot ($/hr)Monthly On-DemandMonthly Spot
4x H100 SXM5$9.60$3.20$6,912$2,304
8x H100 SXM5$19.20$6.40$13,824$4,608
4x H200 SXM5$14.76$5.72$10,627$4,118
8x H200 SXM5$29.52$11.44$21,254$8,237

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Spot instances are viable for batch inference jobs: document processing pipelines, offline evaluation, or any workload where you can handle interruptions. For real-time serving with SLA requirements, on-demand is the right choice.

Compared to proprietary API pricing for frontier models ($3-15 per million tokens), self-hosting DeepSeek V4 on 8x H100 spot breaks even at roughly 15-50 million tokens per month depending on your context length distribution.

For a full competitive pricing comparison across GPU cloud providers, see GPU cloud pricing comparison 2026.

Production Optimization

FP8 Quantization

FP8 is the right default for DeepSeek V4 on H100. Benefits over BF16:

  • ~2x memory reduction (500GB vs 1TB for weights)
  • 1.3-1.5x throughput improvement from Transformer Engine FP8 Tensor Cores
  • Minimal quality loss on standard benchmarks (typically under 1%)

Quantization decision guide:

  • BF16: use when accuracy is critical and you have 8x H200 (1.1TB+ VRAM)
  • FP8 (default): best tradeoff for production on 8x H100 or 4x H200
  • INT4: use only if you need to run on fewer GPUs and can accept quality degradation on precision-sensitive tasks

KV Cache Tuning

FP8 KV cache cuts memory usage by half compared to BF16 KV cache:

bash
# Note: deepseek-ai/DeepSeek-V4 is a provisional model ID; verify upon official release.
vllm serve deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 8 \
  --dtype fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

--max-model-len vs --gpu-memory-utilization tradeoff:

max-model-lenVRAM for KV cacheMax concurrent seqs (at max context)
65536HighLow
32768MediumMedium
16384LowHigh

For most applications, 32768 tokens is a practical ceiling. Very few real requests use full 128K context, and reserving VRAM for maximum context length you never use leaves fewer slots for concurrent users.

For a deeper dive on PagedAttention, prefix caching, and CPU offloading, see the KV cache optimization guide.

Continuous Batching and Throughput

Enable chunked prefill for long-context requests:

bash
--enable-chunked-prefill

Tune maximum concurrent sequences for your latency target:

bash
--max-num-seqs 32   # lower = lower latency, higher = higher throughput

Monitor GPU utilization during load testing:

bash
nvidia-smi dmon -s u -d 5

Watch for GPU utilization below 70% on loaded instances: that usually means --max-num-seqs is too low and the GPU is waiting for more concurrent requests.

For additional throughput techniques including speculative decoding, see the speculative decoding production guide.

DeepSeek V4 vs V3 vs Llama 4 Maverick

ModelTotal ParamsActive ParamsContextBest For
DeepSeek V4 (pre-release)1T~37B*1M**Coding, agentic tasks, reasoning
DeepSeek V3671B37B128KGeneral purpose, production-stable
Llama 4 Maverick400B17B1MLong-context, open weights

*Active parameter count is reported as 37B by some leaks and ~32B by others depending on routing strategy; treat as uncertain until official release.

**V4 context window target is 1M tokens via DeepSeek Sparse Attention (pre-release claim).

Decision guidance:

Choose DeepSeek V4 when coding quality and agentic reasoning are the primary requirements. The 1T scale shows clear gains on complex multi-step tasks and competitive programming problems.

Choose DeepSeek V3 if you want a more established model with a larger community of deployment guides, known quirks, and benchmarked configurations. V3 is production-stable; V4 is newer.

Choose Llama 4 Maverick when you need context windows beyond 64K, prefer Meta's open-weights license, or want lower active parameter counts (17B vs 37B) to reduce serving cost.

For Llama 4 deployment steps, see Deploy Llama 4 on GPU cloud. For GPU hardware selection guidance, see Best GPU for AI inference 2026.


DeepSeek V4 is one of the more demanding open-weight models to self-host, but multi-GPU clusters on Spheron make it accessible without the overhead of managing physical hardware. Spot instances cut inference costs significantly for batch workloads.

Rent H100 SXM5 → | Rent H200 SXM5 → | View all GPU pricing →

Deploy DeepSeek V4 on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.