Tutorial

Deploy DeepSeek V4-Flash on GPU Cloud: 284B MoE, 13B Active Params, 1M Context for Low-Cost Agentic Inference (2026)

deepseek v4 flashdeepseek v4 flash deploydeepseek v4 flash hardware requirementsDeepSeek V4-FlashMoE InferencevLLMGPU CloudLLM DeploymentAgentic Inference
Deploy DeepSeek V4-Flash on GPU Cloud: 284B MoE, 13B Active Params, 1M Context for Low-Cost Agentic Inference (2026)

DeepSeek V4-Flash is the smaller variant of the DeepSeek V4 family: 284B total parameters, 13B active per forward pass, and the same 1M-token context via a hybrid sparse attention mechanism. The DeepSeek V4 deployment guide covers the full-scale V4-Pro variant (approximately 1.6T total parameters, 49B active), which requires far more VRAM. This guide focuses on Flash's reduced hardware footprint and what that means for inference cost in agentic pipelines.

The hybrid Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) mechanism that enables 1M-token context is shared between both variants. If you want to understand how the two-stage sparse attention works before sizing hardware, the DeepSeek Sparse Attention guide covers the full mechanism including FLOPs math and vLLM configuration.

What Is DeepSeek V4-Flash

V4-Flash is a sparse Mixture-of-Experts model. The 284B total parameters represent all expert weights that must reside in VRAM. Per forward pass, the router selects top-6 from 256 routed experts plus 1 shared expert, activating ~13B parameters worth of compute per token.

DeepSeek V4-FlashDeepSeek V4-Pro
Total parameters284B~1.6T
Active per forward pass13B49B
Expert count256 routed + 1 shared384 routed
Expert routingTop-6Top-6
Context window1M tokens1M tokens
FP8 weight VRAM~284 GB~1,600 GB
Minimum GPU config4x H100 SXM512+ H200 SXM5
Intended useCost-sensitive agentic, high throughputComplex coding, peak accuracy

The 13B active parameter count is the number that controls compute per token, not storage requirements. All 284B weights must reside in GPU VRAM at all times because the router can select any expert on any token. This is the most common source of confusion for teams new to MoE serving: "active" means compute-active, not memory-optional. For a thorough explanation of this distinction, see the GPU memory requirements guide for LLMs.

Compared to V4-Pro: same context capability, same attention architecture, a small fraction of the VRAM requirement, and 2-3x higher token throughput on identical hardware. The tradeoff is quality: V4-Flash scores below V4-Pro on complex multi-step coding and reasoning benchmarks, though the gap narrows for tasks that don't require deep chained reasoning.

For expert parallelism fundamentals and VRAM planning across MoE architectures, the MoE inference optimization guide is a good companion read before sizing your cluster.

Why 13B Active Parameters Change the GPU Footprint

The 13B active parameter count per token means Flash's compute throughput matches a 13B dense model, even though all 284B weights sit in VRAM. That asymmetry is Flash's core advantage.

VRAM requirements by precision:

PrecisionWeight VRAMFormula
BF16~568 GB284B × 2 bytes
FP8~284 GB284B × 1 byte
AWQ INT4~142 GB284B × 0.5 bytes

The official DeepSeek-V4-Flash release uses a mixed native format: FP4 for MoE experts and FP8 for dense layers. On-disk, the actual weight files may be smaller than the ~284 GB FP8 estimate in the table above. The ~284 GB figure is the conservative serving baseline for vLLM; if your framework loads the native FP4/FP8 format directly, the real VRAM footprint may be lower.

Compare this to V4-Pro: at FP8 (1 byte per parameter), a ~1.6T-parameter V4-Pro requires roughly 1,600 GB vs Flash's 284 GB. Flash's minimum FP8 config is 4x H100 SXM5 (320 GB), a far smaller entry point than V4-Pro which needs 12+ H200 SXM5 for FP8 serving.

At BF16 precision, Flash's 568 GB weight footprint exceeds the combined 564 GB VRAM of 4x H200 SXM5, making it impossible to load the model at all on 4x H200 at BF16. 6x H200 SXM5 is the practical minimum for BF16 serving with any meaningful context. FP8 is the production choice with negligible quality regression on standard benchmarks.

Hardware Requirements

ConfigurationVRAMQuantizationMax ContextNotes
4x H200 SXM5564 GBFP81M tokensRecommended; ~280 GB for KV after weights
4x H100 SXM5320 GBFP832K tokensBudget FP8 option; ~36 GB for KV after weights
2x H200 SXM5282 GBAWQ INT4128K tokensINT4 only; ~140 GB for KV
2x H100 SXM5160 GBAWQ INT432K tokensINT4 only; ~18 GB for KV after weights

Note: 2x H200 (282 GB) and 2x H100 (160 GB) cannot hold FP8 weights (~284 GB). FP8 requires at least 4x H100 (320 GB). Single-GPU configurations cannot run V4-Flash at any standard precision: even AWQ INT4 weights are ~142 GB, exceeding a single H200 SXM5 (141 GB) or H100 (80 GB). A100 80G SXM4 is excluded from FP8 configs because Ampere lacks the FP8 Transformer Engine required by this serving path.

The 4x H100 row leaves only ~36 GB for KV cache after loading 284 GB of weights, enough for short sessions up to 32K tokens. For context lengths above 32K or multi-user serving, 4x H200 at $19.36/hr on-demand gives ~280 GB of KV headroom versus ~36 GB on 4x H100 at $10.16/hr.

For H100 vs H200 comparison at this scale, see H100 vs H200 for LLM inference.

Serving with vLLM

Config 1: 4x H100 FP8 (Budget, Short Context)

bash
#!/bin/bash
# Install vLLM
pip install "vllm>=0.19.0"

# Verify CUDA
nvidia-smi

# Download V4-Flash weights (~284 GB FP8)
# Note: verify the HuggingFace repo path before downloading; slugs can differ
# from announced model names. Check https://huggingface.co/deepseek-ai for the
# canonical repository name.
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./deepseek-v4-flash

# Launch vLLM server
vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --dtype fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --max-num-seqs 32 \
  --host 0.0.0.0 \
  --port 8000

--max-model-len 32768 is mandatory on 4x H100. With only ~36 GB of KV headroom after weights, going higher will OOM. --enable-chunked-prefill prevents long prefill passes from stalling decode for concurrent requests.

bash
vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --dtype fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --max-num-seqs 64 \
  --host 0.0.0.0 \
  --port 8000

On 4x H200, the ~280 GB of KV headroom lets you run 128K-token sessions comfortably. --max-num-seqs 64 is a reasonable starting point; tune up if GPU utilization stays below 70% under your target load.

Config 3: 4x H200 FP8 with Sparse Attention (1M Context)

bash
vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --dtype fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 1048576 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --max-num-seqs 16 \
  --host 0.0.0.0 \
  --port 8000

V4-Flash uses a hybrid CSA/HCA architecture that keeps active KV compute bounded at long contexts. Before running 1M-token workloads, check the vLLM changelog for the V4-specific attention backend flag, which differs from the V3-series flags (VLLM_ATTENTION_BACKEND=dsa was the V3.2-Exp flag and has not been confirmed for V4). Without the correct backend, dense attention would consume hundreds of GB of KV per concurrent user, making 1M-token inference infeasible even on 4x H200.

Test your endpoint:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": "Summarize the following document in three sentences: [your text]"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Alternative: SGLang. SGLang supports V4-Flash with equivalent flags: python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V4-Flash --tp 4 --enable-ep --dtype fp8 --port 30000. For a full production SGLang setup including health checks and load balancing, see the SGLang production deployment guide.

KTransformers supports V4-Flash for CPU/GPU offloaded serving on consumer hardware, useful if you want to experiment without a datacenter cluster. However, verified CLI flags for V4-Flash specifically are not confirmed in current documentation, so this guide does not provide a KTransformers setup. Check the KTransformers repo directly if CPU offloading is your path.

Tuning for 1M-Token Context Without Blowing Up KV Cache

The 1M-token capability on Flash depends on the CSA/HCA hybrid attention mechanism. The architecture selects a fixed set of compressed KV entries per query pass (top-1,024 compressed entries under the CSA design), so active KV compute stays bounded regardless of context length. For a detailed breakdown of the FLOPs math and vLLM configuration, see the sparse attention guide linked at the top of this post.

On 4x H200 FP8, after loading ~284 GB of weights, roughly 280 GB remains for KV storage. That is enough to hold the full stored KV history for 1-2 concurrent 1M-token sessions.

KV memory vs context length on 4x H200 FP8 (V4-Flash):

--max-model-lenStored KV per Session (BF16)Stored KV per Session (FP8)Concurrent Sessions
32K~10 GB~5 GB56 (FP8 KV)
128K~40 GB~20 GB14 (FP8 KV)
512K~160 GB~80 GB3-4 (FP8 KV)
1M~320 GB~160 GB1-2 (FP8 KV)

Setting --kv-cache-dtype fp8_e5m2 halves KV storage (FP8 KV cache), which is the single most impactful tuning step for 1M-token workloads. At 1M context with FP8 KV, one to two concurrent sessions fit within the 280 GB KV budget on 4x H200.

Rule: set --max-model-len to your actual workload maximum, not the theoretical 1M limit. If your users never send inputs over 128K tokens, running --max-model-len 131072 gives you 14 concurrent sessions on 4x H200 instead of 1-2.

For PagedAttention details and KV budget planning, see the KV cache optimization guide.

Spheron GPU Pricing for DeepSeek V4-Flash

Pricing is based on live rates fetched on 15 Jun 2026. On-demand rates are from DEDICATED offers; spot rates are from SPOT offers using spot_price.

ConfigurationOn-DemandSpot
2x H100 SXM5$5.08/hr$2.98/hr
4x H100 SXM5$10.16/hr$5.96/hr
2x H200 SXM5$9.68/hr$3.56/hr
4x H200 SXM5$19.36/hr$7.12/hr

V4-Flash on 4x H200 at $19.36/hr on-demand delivers 2-3x higher token throughput than V4-Pro on equivalent hardware, because the 13B active parameter compute is lighter per token. If throughput-per-dollar is your metric, Flash wins on any hardware config where both models can run.

Cost-per-million-token math (4x H200 FP8 spot, 1,000 tokens/sec):

$7.12/hr / 3,600 sec/hr = $0.001978/sec

$0.001978/sec / 1,000 tokens/sec = $0.000001978/token

$0.000001978 × 1,000,000 = $1.98/M tokens

At 1,000 tokens/sec on 4x H200 spot, V4-Flash runs at roughly $1.98/M tokens. Actual throughput depends on batch size and context length; short requests batch efficiently and can push throughput higher.

For batch agentic jobs (offline document processing, code review pipelines, data extraction), spot is viable. Interruptions are recoverable with short-interval checkpointing. For real-time serving with latency SLAs, use on-demand.

Pricing fluctuates based on GPU availability. The prices above are based on 15 Jun 2026 and may have changed. Check current GPU pricing for live rates.

V4-Flash vs V4-Pro vs Other Open-Weight Agentic Models

ModelTotal ParamsActive ParamsContextMinimum GPU ConfigBest For
DeepSeek V4-Flash284B13B1M4x H100 FP8Cost-sensitive agentic, high throughput
DeepSeek V4-Pro~1.6T49B1M12+ H200 FP8Complex coding, best accuracy
MiniMax M3229.9B9.8B1M2x H200 FP8Multimodal + coding
Kimi K2.7 Code1T32B256K8x H200 FP8Coding-first agentic

Pick Flash when: your workload involves many short-to-medium calls per session (agentic loops, document chunking, tool-call pipelines), where per-call cost matters more than quality on any single call. The 13B active parameter budget is enough for instruction-following, JSON extraction, summarization, and lightweight code generation.

Pick V4-Pro when: you need peak accuracy on complex multi-step tasks: competitive programming, long-horizon planning, multi-document synthesis with citation accuracy. The 49B active parameters give V4-Pro a measurable edge on benchmarks like SWE-bench and LiveCodeBench, at dramatically higher hardware cost.

Pick MiniMax M3 for multimodal agentic tasks that mix vision and text within the same pipeline. See the MiniMax M3 deployment guide for setup.

Pick Kimi K2.7 Code for pure coding workloads where coding benchmark scores (HumanEval, SWE-bench) are the primary metric. See the Kimi K2.7 Code deployment guide for setup on 8x H200.

When to Use Spot vs On-Demand for Flash Agentic Pipelines

Flash's low per-request cost changes the spot/on-demand calculus compared to heavier models. A V4-Pro spot interruption during a 5-minute generation run loses significant work. A V4-Flash spot interruption during a 30-second agentic call loses very little. That asymmetry makes spot viable for Flash agentic workloads where it would not be viable for V4-Pro.

Practical pattern: run the agent controller (orchestration logic, tool calls, state management) on a small on-demand CPU or GPU instance. Run V4-Flash inference workers on spot. When a spot interruption hits an inference worker, the controller retries the call on the next available worker. The only cost is the retry latency, not lost intermediate state.

For checkpoint and recovery patterns on spot GPU instances, the spot GPU training resilience guide covers the checkpoint interval math that keeps recovery overhead below 5% of total compute cost.


DeepSeek V4-Flash's 284 GB FP8 footprint means you can serve a frontier MoE on 4x H100 SXM5 on Spheron at $10.16/hr on-demand, vs 12+ H200s for V4-Pro. Spot instances lower that cost further for agentic batch workloads.

H200 SXM5 on Spheron | All GPU pricing

Deploy DeepSeek V4-Flash on Spheron →

STEPS / 06

Quick Setup Guide

  1. Calculate VRAM and choose quantization

    DeepSeek V4-Flash at FP8 (1 byte per parameter) needs roughly 284 GB for weights. The minimum FP8 config is 4x H100 SXM5 (320 GB total, ~36 GB KV headroom) for short context up to 32K tokens. For 1M-token context, use 4x H200 SXM5 (564 GB total, ~280 GB KV). AWQ INT4 (~142 GB) requires at least 2x H100 SXM5 (160 GB total, ~18 GB KV) for limited context testing.

  2. Provision a GPU instance on Spheron

    Log into app.spheron.ai and select a 4x H100 SXM5 or 4x H200 SXM5 offer. Set storage to 400 GB minimum (weights are ~284 GB). Choose Ubuntu 22.04 or 24.04. For cost-sensitive agentic workloads, select spot pricing. SSH setup guide: https://docs.spheron.ai/connecting/ssh-connection.

  3. Install vLLM and download V4-Flash weights

    Install vLLM: pip install vllm. Verify CUDA 12.4+ with nvidia-smi. Download weights: huggingface-cli download deepseek-ai/DeepSeek-V4-Flash --local-dir ./deepseek-v4-flash. The FP8 weights are roughly 284 GB.

  4. Launch the vLLM inference server

    Run: vllm serve deepseek-ai/DeepSeek-V4-Flash --tensor-parallel-size 4 --enable-expert-parallel --dtype fp8 --kv-cache-dtype fp8_e5m2 --max-model-len 131072 --gpu-memory-utilization 0.90 --enable-chunked-prefill --host 0.0.0.0 --port 8000. For 1M context on 4x H200, increase max-model-len to 1048576 and verify the correct attention backend flag for V4 in the vLLM changelog.

  5. Enable sparse attention for 1M-token context

    V4-Flash uses a hybrid Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) architecture for 1M-token context. Check the vLLM changelog for the V4-specific attention backend flag before launching, as the flags differ from V3-series models. Without the correct backend, dense attention KV would grow quadratically and make 1M-token inference infeasible even on 4x H200.

  6. Test and monitor

    Send a test request to http://localhost:8000/v1/chat/completions using the OpenAI client. Monitor GPU utilization with nvidia-smi dmon -s u -d 5. For sustained batch throughput, tune --max-num-seqs. Watch GPU utilization: if it stays below 70% under load, increase max-num-seqs.

FAQ / 05

Frequently Asked Questions

DeepSeek V4-Flash has 284B total parameters with 13B active per token. At FP8 (1 byte per parameter), the weights are roughly 284 GB, fitting on 4x H100 SXM5 (320 GB total, ~36 GB left for KV) or 4x H200 SXM5 (564 GB total, ~280 GB for KV). BF16 requires roughly 568 GB total VRAM, which exceeds 4x H200 (564 GB), making it impossible to load the model at all on 4x H200 at BF16. 6x H200 is the practical BF16 minimum. For full 1M-token context, use 4x H200 FP8 to leave KV cache headroom.

V4-Flash is 284B total parameters with 13B active per forward pass, routed through 256 experts. V4-Pro is approximately 1.6T total parameters with 49B active per forward pass, routed through 384 experts. V4-Flash needs a fraction of V4-Pro's VRAM and produces 2-3x more tokens per second on the same hardware. V4-Pro scores higher on complex coding and multi-step reasoning benchmarks; Flash is the right choice when inference cost and throughput per dollar matter more than peak quality.

No. AWQ INT4 reduces the 284B weights from ~284 GB (FP8) to ~142 GB (INT4), but a single H100 SXM5 (80 GB) cannot hold 142 GB of weights. Even a single H200 SXM5 (141 GB) falls just short. The minimum for INT4 is 2x H100 SXM5 (160 GB total, ~18 GB KV headroom). For FP8 production serving, 4x H100 SXM5 (320 GB total, ~36 GB KV) is the minimum. For any real production serving with concurrent users or long context, 4x H200 FP8 is the practical recommendation.

Yes. V4-Flash uses a hybrid sparse attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) that keeps active KV compute bounded regardless of context length, selecting a fixed set of compressed KV entries per query pass. On 4x H200 FP8, after loading ~284 GB of weights, roughly 280 GB remains for KV cache. With FP8 KV cache (~160 GB per 1M-token session), 1-2 concurrent 1M-token sessions fit within the available KV budget.

Yes. The 13B active parameter count gives V4-Flash per-token compute similar to a 13B dense model, which means the same on-demand GPU cost supports 2-3x more agentic steps per dollar vs V4-Pro. For long-running agent loops where the model makes many small calls, Flash's lower per-call cost typically dominates over V4-Pro's quality advantage on any single call.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.