Dynamic Expert Quantization on GPU Cloud: Run Giant MoE Models on Fewer GPUs with Runtime Expert Offloading (2026 Guide)

The assumption behind every MoE deployment guide from the last two years is that all expert weights must live in VRAM simultaneously. You need 4x or 8x H100s for Qwen3-235B not because you are always computing on all those experts, but because the router dispatches to any of them at inference time, so all weights must be present. Dynamic expert quantization breaks that assumption. By assigning precision based on how often each expert actually gets selected, you can cut effective VRAM usage by 30-50% without retraining the model.

This post covers how router-trace hotness profiling works, the mechanics of per-expert mixed-precision quantization, what tools actually support it today, and the cost math comparing a single-GPU DynaExq setup against a multi-GPU expert-parallel baseline. For the fundamentals of MoE memory planning, expert parallelism, and VRAM budgeting, start with the MoE inference optimization guide first.

The MoE Memory Wall

MoE models require all expert weights resident in VRAM for a simple reason: the routing decision happens at inference time. The router sees the input token representation, selects the top-K experts for that token, and only then dispatches it. Since you cannot know which experts will be selected before running the forward pass, every expert's weights must already be loaded.

For Qwen3-235B-A22B, this means holding 235B parameters in memory. At FP8 (1 byte per parameter), that is approximately 235 GB of raw weight storage plus 15% framework overhead, putting you at roughly 270 GB. On H100 SXM5 with 80 GB HBM each, you need at minimum 4 GPUs just to load the weights. Add production-level KV cache requirements and you are looking at 4-8 GPUs depending on context length and batch concurrency.

A simpler framing: at BF16 you need 470 GB for Qwen3-235B weights alone. That requires 6x H100 80GB before any KV cache budget. Even compressed to FP8, you need 3-4x H100. The model spends most inference time activating only 22B of those 235B parameters (the top-8 of 128 experts per layer), but the inactive 213B must still sit in VRAM waiting to be dispatched.

This is the memory wall that uniform quantization partially addresses. INT4 halves weight storage compared to FP8. But it applies the same compression to every expert regardless of how often it gets used. A hot expert that handles 15% of all tokens gets the same lossy treatment as a cold expert that handles 0.3% of tokens. That is the inefficiency dynamic expert quantization is designed to fix. For KV cache memory math beyond weight storage, see the KV cache optimization guide.

Expert Hotness: What Router Traces Reveal

Hotness is a per-expert activation frequency score. You derive it by running a representative sample of your actual traffic through the model and recording which experts the router selects at each layer, for each token.

In a Qwen3-235B model with 128 experts per layer and top-8 routing, each forward pass selects 8 experts per layer per token. Across thousands of tokens, the routing distribution is not uniform. Real production traffic consistently activates a small subset of experts on the majority of tokens.

Approximate routing distributions from a 1,000-token sample (illustrative, based on reported patterns from MoE deployment studies):

Expert frequency bucket	Number of experts	Share of all activations
Very hot (top 20%, 26 experts)	26	~55-65% of activations
Warm (20-50%, 38 experts)	38	~25-30% of activations
Cold (50-100%, 64 experts)	64	~10-15% of activations

The top 20% of experts handle over half of all routing decisions. The bottom 50% handle roughly 10-15% combined. This imbalance is what makes per-expert precision differentiation practical: you can compress the cold half aggressively and accept the accuracy tradeoff, because those experts contribute little to the final output on typical inputs.

Collecting a hotness histogram in practice: run your calibration dataset through the model with a custom hook on the router logits. Record the argmax or top-K selections per layer across all tokens. After 500-1000 representative prompts, you have a stable frequency distribution per expert. A sliding window of recent requests (the last N=1000 forward passes, updated every M=100 steps) is enough for production inference where traffic patterns shift gradually.

The classification boundary is a hyperparameter. Top-30% as hot is a reasonable starting point. If VRAM is tight, push to top-20%. If accuracy is the priority, keep top-40% at high precision.

Dynamic Expert Quantization: How DynaExq Works

The core idea is a per-expert precision state machine with three levels: hot (FP8), warm (INT4), and cold (INT2 or CPU-offloaded). Each expert has a running frequency counter. When an expert's counter crosses a threshold, it transitions to the next precision level.

The state transitions are asymmetric by design:

Cold to warm: triggered when selection frequency exceeds the cold threshold over the last N steps
Warm to hot: triggered when frequency consistently exceeds the hot threshold
Hot to warm or cold: triggered by sustained below-threshold selection frequency, with a hysteresis window to avoid thrashing

The precision transition itself happens asynchronously: the CUDA stream continues serving requests with the current precision while the re-quantization kernel runs on a separate stream. When it completes, the expert's weight pointer atomically switches to the new precision buffer. There is no blocking synchronization step on the inference path.

In practice, no production-ready DynaExq package exists as a standalone pip install as of mid-2026. Implementations fall into two categories:

Static mixed precision with hotness-informed calibration (practical today): Tools like ExLlamaV2 (EXL2 format) support per-layer importance-based quantization. By calibrating with router-trace data where hot experts appear frequently, the quantization algorithm assigns higher bit rates to those layers automatically. This achieves the hot/cold precision split at quantization time rather than dynamically at runtime. The ExLlamaV2 setup walkthrough below uses this approach.

True dynamic runtime precision (research-grade): Patching vLLM's model_runner.py to intercept expert forward calls and apply per-expert dequantization scales is possible but requires maintaining a custom fork. A reference implementation approach: override the MoELayer.forward() method to dispatch through a precision-aware registry that looks up each expert's current quantization state. This is the pattern described in recent MoE quantization research papers. Running it in production requires careful testing for CUDA stream safety.

For the llama.cpp path: imatrix-guided GGUF quantization achieves per-layer mixed precision by using importance scores derived from calibration data. If your calibration set includes prompts that heavily activate certain experts, those expert layers receive more weight protection under Q4_K_M or Q5_K_M recipes. Use ./quantize model.gguf model-mixed.gguf Q4_K_M --imatrix importance.dat with an imatrix generated from your router-trace calibration prompts.

Expert Offloading vs Dynamic Precision: The Tradeoff

Two distinct approaches reduce VRAM requirements for large MoE models. They are often conflated but solve different bottlenecks.

Approach	VRAM reduction	Throughput impact	Accuracy impact	Hardware requirement
CPU DRAM offloading (cold experts)	30-50%	High (-20 to -40%)	Minimal	PCIe bandwidth (~63 GB/s)
INT4 uniform quantization	50%	Low (-5 to -10%)	Moderate (+0.2-0.4 ppl)	Any GPU with FP8/INT8
DynaExq (FP8 hot / INT4 cold)	35-45%	Low (-3 to -5%)	Low (+0.1-0.15 ppl)	FP8-capable GPU (Hopper+)
DynaExq + INT2 tail offload	50-60%	Moderate (-15 to -20%)	Low (+0.15-0.2 ppl)	PCIe bandwidth + FP8 GPU

CPU offloading is throughput-limited by PCIe bandwidth. At ~63 GB/s for PCIe 5.0, loading a cold expert from CPU DRAM takes milliseconds per call. When many requests arrive concurrently, cache misses (cold expert activations) accumulate and create PCIe bottlenecks. For NVMe offloading and the three-tier storage hierarchy (GPU HBM, CPU DRAM, NVMe SSD), see the NVMe KV cache offloading guide, which covers the same principles applied to KV blocks instead of expert weights.

Dynamic precision stays on-GPU. INT4 re-quantization runs on CUDA and avoids the PCIe transfer entirely for warm/hot tier experts. The tradeoff is accuracy: INT4 introduces quantization noise that accumulates across 64 expert layers. FP8 for hot experts limits this to the cold tail where routing frequency is low.

For production deployments requiring maximum throughput and NVLink interconnect, paired with custom GEMM kernels for grouped expert computation, the DeepEP and DeepGEMM guide covers the kernel-level optimization layer. Dynamic precision and custom GEMM are complementary: DynaExq reduces VRAM requirements on the single-GPU or dual-GPU path; DeepGEMM accelerates FP8 grouped GEMM throughput on multi-GPU NVLink clusters.

Benchmarks: Throughput and Accuracy on Qwen3-235B-A22B

The figures below are illustrative estimates based on published per-expert quantization research results (including work on non-uniform expert precision in MoE models). They are not measured results from this specific model configuration. Actual numbers vary by calibration quality, traffic distribution, hardware generation, and quantization implementation. Note: the DynaExq paper (arXiv 2511.15015) published benchmark results on Qwen3-MoE-30B and 80B models; the 235B figures here are extrapolated from those results and should be treated as directional estimates.

Configuration assumptions: Qwen3-235B-A22B, decode-only benchmark (after prefill completes), batch size 1-4, representative instruction-following prompts.

Configuration	GPUs	VRAM used	Tokens/sec (approx)	Perplexity delta	Cost/hr (spot)
FP8 uniform, expert-parallel	4x H100 SXM5	~270 GB	~760	baseline	$5.96/hr
INT4 all experts (GPTQ)	2x H100 SXM5	~135 GB	~700 (-8%)	+0.25 ppl	$2.98/hr
DynaExq (FP8 hot / INT4 cold)	2x H100 SXM5	~135 GB	~737 (-3%)	+0.10 ppl	$2.98/hr
DynaExq + CPU offload (INT2 cold)	1x H200 SXM5	~128 GB	~623 (-18%)	+0.15 ppl	$1.77/hr

These figures are illustrative estimates consistent with published results on per-expert mixed-precision quantization. Exact throughput depends on batch size, calibration quality, and implementation. Perplexity deltas are approximate and model-specific.

Key observations: DynaExq on 2x H100 gives nearly the same throughput as uniform GPTQ INT4 but with significantly better accuracy, because the hot expert population stays at FP8 rather than losing precision. The DynaExq + CPU offload configuration achieves a single-GPU deployment at the cost of 18% throughput reduction versus the 4x H100 baseline.

Spot prices used: H100 SXM5 $1.49/hr, H200 SXM5 $1.77/hr per GPU, as of 13 Jun 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 13 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Setup Walkthrough: Single-GPU Node with DynaExq

This walkthrough uses ExLlamaV2 with EXL2 mixed-precision quantization, which is the practical path for per-expert hotness-based precision allocation as of mid-2026.

VRAM Budget Calculation

For Qwen3-235B-A22B:

Total experts per layer: 128, top-8 routing
Approximate parameters per expert: ~1.56B (computed from total params minus shared attention/MLP layers)
Hot fraction (top 30%): 38 experts at FP8 (1 byte/param) = 38 × 1.56B = ~59.3 GB
Cold fraction (bottom 70%): 90 experts at INT4 (0.5 bytes/param) = 90 × 1.56B × 0.5 = ~70.2 GB
Shared layers (attention, routing, embeddings) at FP8: approximately 35 GB
Total weights: ~164.5 GB
With 15% overhead: ~189 GB

This exceeds a single H200 SXM5's 141 GB. To fit on one H200, increase the cold expert INT4 ratio to cover 85% of experts, or use INT2 for the coldest 40%. That brings total weight storage to approximately 125-130 GB, leaving 11-16 GB for KV cache. Practical for short-context (8K or less) inference or with NVMe KV offloading.

For Qwen3-80B-class MoE models with fewer total parameters, a single H200 is comfortable at DynaExq mixed precision with room for larger KV caches.

Provision the Instance

Log into app.spheron.ai and select one of:

H200 SXM5 141GB for Qwen3-80B-class models or Qwen3-235B at reduced context
2x H100 SXM5 (160 GB combined via NVLink) for Qwen3-235B at production context lengths

For step-by-step instance setup and SSH access, see the Spheron docs. For H100 instance options and current on-demand pricing starting from $2.54/hr, see H100 GPU rental on Spheron. For H200 availability starting from $4.84/hr on-demand, see H200 GPU rental on Spheron.

Pricing fluctuates based on GPU availability. The prices above are based on 13 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Install ExLlamaV2

ExLlamaV2 is now archived, with active development continuing as ExLlamaV3. The EXL2 format and tooling remain fully functional for mixed-precision inference.

bash

git clone https://github.com/turboderp-org/exllamav2
cd exllamav2
pip install -e .
pip install "torch>=2.2.0" transformers accelerate

# Verify installation
python -c "import exllamav2; print('ExLlamaV2 ready')"

Prepare Calibration Data for EXL2 Conversion

EXL2's per-layer mixed precision is error-driven: the quantization algorithm measures how much each tensor's output changes under quantization and automatically assigns more bits to sensitive layers. When your calibration data activates hot experts frequently, those layers receive higher weight under the error budget, producing the hot/cold precision split without any manual importance file.

There is no separate measure.py script in ExLlamaV2. The measurement pass runs as the first phase of convert.py and saves measurement.json to the output directory. To iterate with different -b settings, reuse the cached measurement by passing -m /path/to/output/measurement.json on subsequent runs to skip the expensive calibration pass.

EXL2 calibration expects Parquet format, not JSONL. Convert your prompts first:

bash

# Convert calibration prompts from JSONL to Parquet
# (Qwen3-235B-A22B is large - expect ~470 GB for BF16 weights)
python -c "
import pandas as pd
df = pd.read_json('/path/to/calibration-prompts.jsonl', lines=True)
df.to_parquet('/path/to/calibration.parquet')
"

Use 500-1000 examples representative of your actual workload. Generic calibration data like Wikitext-2 activates a different expert distribution than production traffic. Use your real use case data for best results. Note: this imatrix concept applies to the llama.cpp/GGUF path described earlier; in EXL2, the calibration data feeds directly into convert.py.

Quantize to EXL2 Mixed Precision

bash

# -i: input model directory
# -o: output directory for the quantized EXL2 model
# -c: calibration dataset in Parquet format
# -b 4.65: average bits per weight (targets ~135 GB total)
# -hb 8: bit-width for the lm_head (output projection) layer
python exllamav2/conversion/convert.py \
  -i /models/qwen3-235b-a22b/ \
  -o /models/qwen3-235b-exl2-4.65bpw/ \
  -c /path/to/calibration.parquet \
  -b 4.65 \
  -hb 8

EXL2's per-layer bit allocation is entirely automatic: the algorithm measures quantization error for each tensor during the calibration pass and assigns higher bits to layers with greater sensitivity. With calibration data that activates hot experts frequently, those layers naturally receive more bit budget under the -b 4.65 average. The first run generates measurement.json in the output directory; for subsequent runs at different -b levels, add -m /models/qwen3-235b-exl2-4.65bpw/measurement.json to skip the calibration pass.

The -hb 8 flag sets precision for the lm_head output projection layer only, not for expert FFN layers. Adjust -b to control total VRAM usage: 4.0 bpw targets approximately 118 GB, 5.0 bpw targets approximately 146 GB.

Launch the Inference Server

tabbyAPI is configured via config.yml, not CLI flags. Copy the sample config and edit it:

bash

# Using tabbyAPI (OpenAI-compatible server for ExLlamaV2)
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
pip install -r requirements.txt

# Copy sample config and edit it
cp config_sample.yml config.yml

Edit config.yml to set your model path and cache options:

yaml

# config.yml
model_dir: /models/qwen3-235b-exl2-4.65bpw/
cache_mode: Q4       # INT4 KV cache to free additional VRAM
max_seq_len: 8192    # conservative for H200 single-GPU; reduce further if OOM

Then launch the server:

bash

python main.py

Monitor VRAM usage after model load: nvidia-smi --query-gpu=memory.used --format=csv. If you are above 90% before any requests, reduce max_seq_len in config.yml or drop to a lower -b quantization level.

Benchmark and Validate

bash

# Run accuracy check against FP8 baseline on MMLU
lm_eval --model local-completions \
  --model_args model=qwen3-235b-exl2,base_url=http://localhost:5000/v1 \
  --tasks mmlu \
  --num-fewshot 5 \
  --batch-size 4

# Throughput benchmark: 200 prompts, representative length
python benchmark_serving.py \
  --base-url http://localhost:5000/v1 \
  --model qwen3-235b-exl2 \
  --num-prompts 200 \
  --request-rate 2

Acceptable accuracy floor: MMLU score within 1.5 percentage points of the FP8 baseline. If degradation exceeds this, increase the hot expert bit rate by raising -hb to 8 (already maximum) and increasing -b by 0.5 bpw increments.

When This Beats Expert Parallelism (and When It Does Not)

Dynamic expert quantization solves a different problem than expert parallelism. The decision between them depends on GPU availability and your throughput requirements.

DynaExq wins when:

GPU budget is constrained to 1-2 GPUs and you cannot run multi-GPU expert-parallel
Routing imbalance is high: a hotness ratio of 3:1 or greater between the most-activated and least-activated experts makes the precision split highly effective
Latency SLO allows up to 20% decode throughput reduction compared to uniform FP8 at higher GPU count
You are running batch size 1-4 (low concurrency), where single-GPU throughput is sufficient and cost per token matters more than absolute peak throughput
The model has many total experts (32, 64, 128, or 256), making the cold expert population large and the VRAM savings substantial

Expert parallelism wins when:

You have 4 or more high-bandwidth NVLink GPUs available and want maximum throughput
Traffic is high-concurrency (16+ simultaneous requests) where multi-GPU batching amortizes routing overhead
Routing is relatively uniform: if all experts are equally likely, there is no hot/cold distinction and DynaExq provides no accuracy advantage over uniform INT4
SLO requires peak decode throughput with no degradation versus full FP8

Hybrid: the practical middle ground. Two GPUs with DynaExq mixed precision provides the best of both approaches for budget-constrained deployments. The DynaExq VRAM savings let you run Qwen3-235B on 2x H100 instead of 4x H100, halving hourly cost while losing only 3-5% throughput compared to uniform FP8 on 4x H100. This is the configuration most useful for production inference at $3-6/hr rather than $6-12/hr.

For MoE models where you want to combine expert parallelism across multiple GPUs with speculative decoding to further accelerate decode, see the speculative decoding for MoE models guide. The two techniques are not mutually exclusive: you can run DynaExq mixed-precision weights on a 2-GPU configuration and add a draft head for decode acceleration. At the extreme end of the MoE weight spectrum, NVFP4 on B200 SXM6 enables deploying V4-Pro on a single B200 node: at 4-bit hardware-accelerated precision, the 1.6T weight set drops from ~1,600 GB (FP8) to ~800 GB, fitting a single 8x B200 SXM6 node (1,536 GB VRAM) with room for KV cache.

Cost-Per-Token Math: One GPU + DynaExq vs Four-GPU Expert Parallel

The most important comparison for budget-sensitive inference is cost per million tokens, not peak throughput.

Formula: cost_per_1M_tokens = (hourly_rate / tokens_per_hour) × 1,000,000

Using live Spheron spot pricing as of 13 Jun 2026:

Configuration	Hourly cost (spot)	Approx tokens/sec	Tokens/hr	Cost/1M tokens
4x H100 SXM5, FP8 uniform	$5.96/hr	~760	~2,736,000	~$2.18
2x H100 SXM5, DynaExq FP8/INT4	$2.98/hr	~737	~2,653,200	~$1.12
1x H200 SXM5, DynaExq + offload	$1.77/hr	~623	~2,242,800	~$0.79

The single H200 with DynaExq delivers cost per token that is 63% lower than the 4x H100 FP8 baseline, while throughput drops only 18%. For workloads where you are running continuous inference jobs and per-token cost drives the decision, this gap justifies the throughput tradeoff.

On-demand pricing (for production workloads with tighter latency SLOs): H100 SXM5 from $2.54/hr, H200 SXM5 from $4.84/hr.

Configuration	On-demand cost/hr	Cost/1M tokens (on-demand)
4x H100 SXM5, FP8 uniform	$10.16/hr	~$3.71
2x H100 SXM5, DynaExq FP8/INT4	$5.08/hr	~$1.91
1x H200 SXM5, DynaExq + offload	$4.84/hr	~$2.16

Throughput estimates are illustrative and consistent with published per-expert mixed-precision quantization research. Actual tokens/sec vary by model configuration, batch size, sequence length, and hardware. Measure on your specific workload before production deployment.

Pricing fluctuates based on GPU availability. The prices above are based on 13 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

The cost advantage compounds at scale. A deployment running 10 billion tokens per month pays approximately $21,800 on a 4x H100 FP8 setup versus $7,900 on a single H200 with DynaExq: a $13,900/month difference for roughly equivalent output quality, at reduced peak throughput. If throughput is not the bottleneck, that is a straightforward optimization.

Running a frontier MoE on one or two GPUs instead of a four- or eight-card node is now practical with dynamic expert quantization. Spheron provides per-minute billing on bare-metal H100 and H200 instances, so you pay only for actual inference time.
H100 on Spheron → | H200 on Spheron → | View all pricing →

STEPS / 07

Quick Setup Guide

Profile router hotness on your target model
Run 500-1000 representative prompts through the model and record which expert each router selects per layer. Sort experts by selection frequency across that sample. Classify the top 20-30% by frequency as hot, the rest as cold. This becomes your importance matrix input for mixed-precision quantization.
Calculate VRAM budget with mixed-precision expert sizing
For Qwen3-235B-A22B with 128 experts and approximately 1.56B params each: hot experts at FP8 (1 byte/param) = 38 experts * 1.56B * 1 = ~59 GB. Cold experts at INT4 (0.5 bytes/param) = 90 experts * 1.56B * 0.5 = ~70 GB. Shared attention/MLP layers at FP8 = ~35 GB. Total before overhead = ~164 GB. With 15% overhead = ~189 GB. To fit on H200 (141 GB), push cold expert ratio higher or use INT2 for the lowest-frequency tail.
Provision a Spheron single-GPU or dual-GPU instance
Log into app.spheron.ai and select an H200 SXM5 141GB (single GPU, practical for Qwen3-80B-class models at mixed precision) or 2x H100 SXM5 (160 GB combined, suitable for Qwen3-235B at mixed precision with limited KV cache). SXM form factor instances include NVLink for dual-GPU configurations.
Install ExLlamaV2 and prepare calibration data
Clone and install ExLlamaV2: git clone https://github.com/turboderp-org/exllamav2 && pip install -e . There is no separate measurement script: the first convert.py run automatically measures per-tensor quantization error from your calibration data and writes measurement.json to the output directory. Reuse it on subsequent runs with -m measurement.json to skip the expensive calibration pass. Calibration data must be in Parquet format, not JSONL.
Quantize with EXL2 mixed-precision targeting expert hotness
Run exllamav2/conversion/convert.py -i /path/to/model -o /path/to/output -c /path/to/calibration.parquet -b 4.65 -hb 8. The -b 4.65 flag sets the average bits per weight; EXL2 automatically allocates more bits to tensors with higher quantization error, which naturally protects frequently-activated expert layers when your calibration data activates them. The -hb 8 flag sets precision for the lm_head output layer only. Adjust -b to control total VRAM: 4.0 bpw targets roughly 118 GB, 5.0 bpw roughly 146 GB.
Launch the inference server with the mixed-precision model
Start tabbyAPI (the ExLlamaV2 OpenAI-compatible server): copy config_sample.yml to config.yml and set model_dir, cache_mode: Q4, and max_seq_len in config.yml. Launch with python main.py. The Q4 cache mode keeps KV cache in INT4, freeing additional VRAM for context. Monitor GPU memory with nvidia-smi to confirm the model fits within budget.
Benchmark throughput and verify accuracy floor
Run lm-evaluation-harness on MMLU and ARC-C against your mixed-precision model and the FP8 baseline. Acceptable degradation is under 0.5 perplexity on MMLU and under 3% throughput loss compared to uniform INT4. Measure tokens/sec with 200 representative prompts at your target batch size.

FAQ / 05

Frequently Asked Questions

Dynamic expert quantization (DynaExq) assigns different numerical precision to each expert based on how frequently the router selects it. Experts that get activated on most tokens (hot experts) stay in FP8 in VRAM. Experts that rarely activate (cold experts) are compressed to INT4 or INT2 and may be partially offloaded to CPU DRAM. This cuts effective VRAM usage by 30-50% on large MoE models without retraining, compared to uniform precision quantization where every expert uses the same bit-width.

Expert hotness is a per-expert activation frequency score derived from router trace data - the distribution of which experts the top-K router selects across a representative sample of requests. In production, a sliding window of the last N forward passes is enough to build a hotness histogram. Experts in the top 20-30% by selection frequency are classified as hot; the rest are cold. The classification updates asynchronously so the CUDA stream is not blocked during precision transitions.

Static quantization compresses all experts uniformly before serving and never changes precision at runtime. Dynamic expert quantization re-quantizes individual experts on the fly based on observed routing patterns. Hot experts are promoted back to FP8 when traffic patterns shift, cold experts are demoted. The accuracy advantage is significant on models with heavy routing imbalance, like DeepSeek-style models where 256 experts exist but top-8 routing concentrates load on a recurring subset.

Qwen3-235B-A22B has 235B total parameters. At INT4 (0.5 bytes/param) for cold experts and FP8 (1 byte/param) for hot experts, effective weight storage with a 70/30 cold/hot split runs to roughly 130-135 GB, leaving 6-10 GB for KV cache. That is marginal for real workloads. A single B200 SXM6 at 192 GB is the comfortable single-GPU target. An H200 works at reduced context (8K or less) or with NVMe KV offloading for anything longer.

Skip dynamic precision when: (1) the model has few total experts (8 or fewer), because the overhead of per-expert precision tracking outweighs savings; (2) routing is fully uniform so every expert is equally likely, which removes the hot/cold distinction; (3) you are already running INT4 uniformly and cannot tolerate any accuracy regression from demoting cold experts even further.

The MoE Memory Wall

Expert Hotness: What Router Traces Reveal

Dynamic Expert Quantization: How DynaExq Works

Expert Offloading vs Dynamic Precision: The Tradeoff

Benchmarks: Throughput and Accuracy on Qwen3-235B-A22B

Setup Walkthrough: Single-GPU Node with DynaExq

VRAM Budget Calculation

Provision the Instance

Install ExLlamaV2

Prepare Calibration Data for EXL2 Conversion

Quantize to EXL2 Mixed Precision

Launch the Inference Server

Benchmark and Validate

When This Beats Expert Parallelism (and When It Does Not)

Cost-Per-Token Math: One GPU + DynaExq vs Four-GPU Expert Parallel

Quick Setup Guide

Profile router hotness on your target model

Calculate VRAM budget with mixed-precision expert sizing

Provision a Spheron single-GPU or dual-GPU instance

Install ExLlamaV2 and prepare calibration data

Quantize with EXL2 mixed-precision targeting expert hotness

Launch the inference server with the mixed-precision model

Benchmark throughput and verify accuracy floor

Frequently Asked Questions

01What is dynamic expert quantization in MoE models?

02What is expert hotness and how is it measured?

03How does dynamic expert quantization differ from static INT4/GPTQ quantization?

04Can I run Qwen3-235B with dynamic expert quantization on a single H200?

05When does dynamic expert quantization NOT make sense?

Try It on Real GPUs