DeepSeek V3.2 Speciale is among the most capable open-source reasoning models available today. It delivers top-tier performance on math benchmarks, achieved gold-medal level performance in the 2025 International Mathematical Olympiad, and ranks competitively against frontier proprietary models on reasoning tasks. The model features DeepSeek Sparse Attention (DSA), a mechanism that dramatically reduces compute for long-context inputs.
The catch is hardware requirements. This is a 685B parameter Mixture-of-Experts model. You can't run it on a single GPU, and you need to get the configuration right to achieve reasonable throughput. Most deployment guides either skip the practical details or assume you have unlimited budget.
This guide gives you the exact hardware specs, vLLM configuration, and deployment steps, from the minimum setup that actually works to the production configuration that handles real traffic. Check out our GPU memory requirements guide for detailed VRAM planning across different models, and our 2026 GPU requirements cheat sheet for quick reference comparisons. For a detailed look at how MoE inference works across all major models, see our MoE inference optimization guide.
Model Specifications
| Spec | DeepSeek V3.2 Speciale |
|---|---|
| Total Parameters | 685B |
| Active Parameters | 37B per token |
| Architecture | Mixture of Experts (MoE) |
| Context Window | 128K tokens |
| FP16 Model Size | ~1.4 TB |
| FP8 Model Size | ~690 GB |
| Key Feature | DeepSeek Sparse Attention (DSA) |
| Supported Hardware | Hopper (H100/H200) and Blackwell (B200/B300) only |
That last line is important: only Hopper and Blackwell datacenter GPUs are supported. Consumer GPUs (RTX 4090, RTX 5090) and older datacenter GPUs (A100) won't work with the official model weights. The core constraint for consumer GPUs is memory capacity: the RTX 5090 has 32GB and the RTX 4090 has 24GB, versus the ~690GB needed for the FP8 weights alone. A100s lack Hopper's Transformer Engine required for the mixed-precision FP8 kernel implementations used in V3.2 Speciale. See our analysis of best NVIDIA GPUs for LLMs to understand why Hopper and Blackwell datacenter GPUs dominate large model deployments. If you're choosing between this model and Llama 4 or Qwen 3, see our DeepSeek vs Llama 4 vs Qwen 3 comparison for a cost and benchmark breakdown.
GPU Hardware Requirements
Minimum Viable: 8x H100 80GB (640 GB total)
This is the floor for running V3.2 Speciale. With FP8 quantization, the model fits across 8x H100s using tensor parallelism. You'll have limited headroom for KV cache, which means shorter effective context lengths (32K-64K) and smaller batch sizes.
Configuration: tensor-parallel-size 8, FP8 precision, max context ~64K tokens.
Recommended: 8x H200 141GB (1.13 TB total)
The H200's 141 GB per GPU gives you 1.13 TB total, enough for the FP8 model weights plus generous KV cache for full 128K context. This is what you want for production inference with long-context workloads. The H200 GPU rental option provides this exact configuration as a managed service.
Configuration: tensor-parallel-size 8, FP8 precision, max context 128K tokens.
Optimal: 8x B300 288GB (2.3 TB total)
On Blackwell hardware, V3.2 Speciale runs with maximum headroom. The B300's 288 GB per GPU gives you 2.3 TB total, enough for FP16 weights with room to spare, or FP8 weights with massive KV cache for high-concurrency serving.
Configuration: tensor-parallel-size 8, FP8 or FP16 precision, max context 128K tokens with high batch sizes.
What Won't Work
- 4x H100 80GB (320 GB): Not enough VRAM for the full model even in FP8. The model weights alone require ~690 GB.
- Any number of A100s: Hopper-class hardware is required. The model's FP8 operations are incompatible with A100 tensor cores.
- Consumer GPUs: RTX 4090 (24GB) and RTX 5090 (32GB) lack the memory capacity required. The FP8 model weights alone need ~690GB, far exceeding what any single consumer GPU can provide.
Step-by-Step Deployment with vLLM
Prerequisites
Provision a GPU server with 8x H100, H200, or B300 GPUs. On Spheron, you can get H100 rental as Spot instances starting at $1.49/hr per GPU, H200 Dedicated instances, or B300 GPUs starting at $2.90/hr Spot.
SSH into your server and verify the GPU setup:
nvidia-smi
# Verify 8 GPUs visible, CUDA 12.x, driver 535+Install Dependencies
pip install vllm --upgrade
# DeepGEMM is required for MoE computation
pip install git+https://github.com/deepseek-ai/DeepGEMMDeepGEMM provides optimized kernels for the MoE layers. Without it, inference falls back to slower generic implementations.
Download Model Weights
The model weights are approximately 690 GB (FP8). Download from Hugging Face:
huggingface-cli download deepseek-ai/DeepSeek-V3.2-Speciale \
--local-dir /data/models/deepseek-v3.2-specialeThis takes 30-60 minutes depending on your network bandwidth. Use a persistent storage volume so you don't re-download on instance restarts.
Launch vLLM Server
Standard deployment on 8x H100:
vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--max-model-len 65536 \
--port 8000Production deployment with FP8 KV cache:
vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--kv-cache-dtype fp8 \
--max-model-len 131072 \
--port 8000Using FP8 KV cache halves the memory consumed by the attention cache, letting you either serve longer contexts or handle more concurrent requests.
Optimized deployment with Expert Parallelism:
For best throughput, DeepSeek recommends running with Expert Parallelism (EP) rather than pure Tensor Parallelism:
# EP mode: better throughput for MoE models
VLLM_USE_DEEP_GEMM=1 vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
--tensor-parallel-size 1 \
--pipeline-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--max-model-len 131072 \
--port 8000Note: if you encounter stability issues with EP mode, fall back to tensor parallelism (TP=8), which is more robust though slightly slower.
Test the API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.2-Speciale",
"messages": [
{"role": "user", "content": "Prove that the square root of 2 is irrational."}
],
"max_tokens": 1024,
"temperature": 0.6
}'Cost Analysis
Running V3.2 Speciale full-time isn't cheap, but it's dramatically cheaper than API pricing for equivalent capability.
| Configuration | GPUs | Hourly Cost | Monthly (24/7) | Cost per 1M tokens |
|---|---|---|---|---|
| 8x H100 (Spot) | 8x H100 | ~$12.00/hr | ~$8,640 | ~$0.30 |
| 8x H100 (Dedicated) | 8x H100 | ~$16.00/hr | ~$11,520 | ~$0.40 |
| 8x H200 (Dedicated) | 8x H200 | ~$28.00/hr | ~$20,160 | ~$0.25 |
| 8x B300 (Spot) | 8x B300 | ~$23.20/hr | ~$16,704 | ~$0.10 |
Compare this to API pricing for equivalent-capability models: frontier proprietary APIs at $3-75 per million tokens depending on the model. If you're making more than ~100M tokens of requests per month, self-hosting V3.2 Speciale on H100 Spot instances is cheaper than any frontier API.
Performance Tuning
Context length vs throughput tradeoff. Every doubling of max context length halves your maximum concurrent batch size (roughly). If your application doesn't need 128K context, set --max-model-len lower and use the freed VRAM for higher concurrency.
DeepGEMM tuning. Some users report better performance with VLLM_USE_DEEP_GEMM=0 on certain GPU configurations (particularly H20s). Benchmark with and without to find what works for your hardware.
Temperature for reasoning. DeepSeek recommends temperature 0.6 for reasoning tasks. Lower temperatures (0.1-0.3) produce more deterministic but sometimes less thorough reasoning chains.
Persistent storage matters. The model is ~690 GB. Re-downloading on every instance restart wastes time and bandwidth. Always use persistent network-attached storage, and pre-download the weights before switching to Spot instances to avoid interruption during the download.
When to Use V3.2 Speciale vs Other Models
Use V3.2 Speciale when: your workload is math-heavy reasoning, complex code generation, scientific analysis, or any task where chain-of-thought reasoning quality matters more than speed. It's the best open-source model for tasks that require multi-step logical reasoning.
Use Llama 4 Scout instead when: you need ultra-long context (10M tokens versus 128K), lower deployment cost (single GPU versus 8 GPUs), or faster inference speed on simpler tasks.
Use Llama 4 Maverick instead when: you need strong general-purpose performance with better cost efficiency than V3.2 Speciale (Maverick uses 400B total but only 17B active, vs Speciale's 685B total / 37B active).
The open-source model landscape in 2026 gives you real choices. V3.2 Speciale is the reasoning champion. Llama 4 Scout is the context-length and efficiency champion. Pick based on your workload, not hype.
If you're evaluating the newer model, see the DeepSeek V4 deployment guide for the updated 1T-parameter architecture and expert parallelism setup.
