Tutorial

Deploy Qwen3.7 Max on GPU Cloud: MoE Setup Guide (2026)

Qwen3.7 MaxQwen 3.7 MaxDeploy Qwen3.7 MaxQwen3.7 Max GPU RequirementsMoEvLLMSGLangGPU CloudLLM Deployment
Deploy Qwen3.7 Max on GPU Cloud: MoE Setup Guide (2026)

Qwen3.7 Max is Alibaba's frontier MoE model announced in May 2026, following Qwen 3.6 Plus by roughly six weeks. The architecture continues the MoE lineage with updated expert routing, reported improvements across standard benchmarks, and the same 1M token context window established in the prior generation.

For GPU sizing and deployment steps covering the previous generation, see the Qwen 3.6 Plus deployment guide, which covers the hybrid linear attention architecture and always-on chain-of-thought. For the largest competing open-weight MoE model, see the DeepSeek V4 deployment guide for expert parallelism configuration and hardware comparisons. For a direct benchmark comparison across the current frontier model set, see the open-weight frontier model showdown.

Exact architecture details, parameter counts, and benchmark figures for Qwen3.7 Max are not fully confirmed at time of writing. This guide flags speculative or third-party-sourced claims explicitly. Verify all figures against the official Alibaba blog post and Hugging Face model card before making hardware or cost decisions.

What's New in Qwen3.7 Max

Qwen3.7 Max introduces changes in three areas compared to Qwen 3.6 Plus: MoE routing, attention mechanism, and benchmark performance. Here is what is reported vs. what remains unconfirmed.

MoE Routing Updates

Early third-party analyses describe Qwen3.7 Max as using a larger expert pool than Qwen 3.6 Plus, with a refined top-K routing strategy that improves load balancing across experts. The specific changes have not been published in a technical report from Alibaba as of June 2026. Treat all figures on expert count and routing parameters as estimates until official documentation is available.

The practical implication for GPU planning is straightforward: a larger expert pool increases total weight storage without changing the per-token compute (which depends on active parameters, not total). If the flagship model has more total parameters than Qwen 3.6 Plus, it requires more VRAM for weight storage, even though inference speed is roughly proportional to active parameters.

Attention Mechanism

Qwen3.7 Max reportedly builds on the hybrid attention architecture introduced in Qwen 3.6 Plus, with modifications to how attention is mixed with the MoE layers. Alibaba has not published a detailed architecture description as of June 2026. The 1M token context window is carried over from Qwen 3.6 Plus, suggesting the linear attention components remain structurally similar.

For planning purposes: assume the same KV cache memory requirements as Qwen 3.6 Plus for equivalent context lengths, and add overhead for the larger weight set if the total parameter count is higher.

Reported Benchmark Improvements

The table below shows reported benchmark deltas relative to Qwen 3.6 Plus and Qwen 3.5. All Qwen3.7 Max figures are from early third-party analyses and have not been verified against the official Alibaba technical report. Treat all Qwen3.7 Max figures as provisional.

BenchmarkQwen 3.5Qwen 3.6 PlusQwen3.7 Max (est.)Notes
MMLU~87~89~91 (est.)Qwen3.7 Max figure unconfirmed
HumanEval~88~90~93 (est.)Qwen3.7 Max figure unconfirmed
MATH~90~91~93 (est.)Qwen3.7 Max figure unconfirmed
GPQA~75~77~80 (est.)Qwen3.7 Max figure unconfirmed
SWE-bench Verified~60~63~67 (est.)Qwen3.7 Max figure unconfirmed

Qwen 3.5 and Qwen 3.6 Plus figures are from published model cards where available, with a few from third-party analyses flagged in the source posts. Verify all figures against official publications before using them in benchmark comparisons.

Context Window

Qwen3.7 Max maintains the 1M token context window from Qwen 3.6 Plus. The KV cache memory requirements for 1M context are the same as documented in the Qwen 3.6 Plus guide: roughly 262-524 GB of additional VRAM per sequence, depending on how many standard vs. linear attention layers remain in the hybrid architecture. This makes 1M context only practical on 8x H200 (1,128 GB total) or similar large configurations.

For most production workloads, limit context to 32K-128K tokens and provision hardware accordingly. The 1M context capability exists but comes with real VRAM costs.

GPU Hardware Requirements

VRAM Requirements Table

Qwen3.7 Max total parameter counts are unconfirmed. The table below uses estimates based on the MoE scaling patterns from Qwen 3.6 Plus and comparable frontier models. All figures marked "(est.)" are speculative until Alibaba publishes official model card data.

VariantTotal Params (est.)Active Params (est.)FP16 Size (est.)FP8 Size (est.)Min GPU ConfigRecommended Config
Qwen3.7 Max MoE-mid~200-350B (est.)~25-30B (est.)~400-700 GB (est.)~200-350 GB (est.)2x H200 or 4x H1004x H200 FP8
Qwen3.7 Max MoE-flagship~700B-1T (est.)~37-42B (est.)~1.4-2 TB (est.)~700 GB-1 TB (est.)4x H200 or 8x H1008x H100 FP8 or 4x H200 FP8

Notes on these estimates:

  • FP16 size = total params × 2 bytes. FP8 size = total params × 1 byte. Both include a 10-15% runtime overhead estimate for activations, framework memory, and KV cache pre-allocation at 32K context.
  • Active parameter count does not reduce VRAM requirements. All weights must be loaded regardless of how many activate per token.
  • These estimates assume scaling patterns consistent with DeepSeek V4 and Qwen 3.6 Plus. Actual sizes will differ when Alibaba publishes the model card.

GPU Configurations with Live Pricing

The following configurations use pricing from the Spheron GPU pricing API fetched on 2026-06-08. All prices are per GPU per hour for the minimum available offer; actual bundle pricing may vary.

#### Single H200 SXM5 (141 GB VRAM): Mid-Variant at FP8

If Alibaba releases a MoE-mid sub-variant, a single H200 SXM5 provides 141 GB of VRAM, enough for a ~100-120 GB FP8 model at --gpu-memory-utilization 0.9, with ~20-40 GB remaining for KV cache at standard context lengths (32K-64K).

Current rates on Spheron:

  • On-demand: $5.92/hr
  • Spot: $3.31/hr

At spot pricing, the single-H200 configuration is ~$2.61/hr cheaper than on-demand, which adds up to roughly $62.64 over a 24-hour run.

For H200 GPU rental on Spheron, see current availability and spot offers on the GPU rental page.

#### 2x H200 SXM5 (282 GB VRAM): Mid-Variant with Extended Context

Two H200s give 282 GB of aggregate VRAM. At FP8 for a ~200B total parameter mid-variant (estimated ~200 GB weights), this leaves ~60-70 GB for KV cache, supporting 128K-256K context lengths on the mid-variant.

Estimated cluster rates:

  • On-demand: ~$11.84/hr (2 × $5.92)
  • Spot: ~$6.62/hr (2 × $3.31)

Use --tensor-parallel-size 2 in vLLM for this configuration.

#### 4x H200 SXM5 (564 GB VRAM): Flagship FP8 Entry Point

Four H200 nodes give 564 GB of aggregate VRAM, which fits an estimated ~500 GB FP8 flagship model with roughly 60-70 GB for KV cache headroom at 32K-64K context. This is the minimum configuration for running the flagship MoE variant at FP8.

Estimated cluster rates:

  • On-demand: ~$23.68/hr (4 × $5.92)
  • Spot: ~$13.24/hr (4 × $3.31)

Use --tensor-parallel-size 4 and --enable-expert-parallel for this configuration.

#### 8x H100 SXM5 (640 GB VRAM): Alternative Flagship Config

Eight H100 SXM5 GPUs (80 GB each) give 640 GB of aggregate VRAM with NVLink-connected inter-GPU bandwidth. This is the standard 8-GPU alternative to 4x H200 for the flagship FP8 configuration, with slightly more aggregate VRAM but lower per-GPU bandwidth and compute throughput compared to H200.

Estimated cluster rates:

  • On-demand: ~$40.08/hr (8 × $5.01)
  • Spot: ~$11.92/hr (8 × $1.49)

Use --tensor-parallel-size 8 and --enable-expert-parallel.

#### B200 Configurations

B200 SXM6 nodes have 192 GB of HBM3e memory, which fits a smaller Qwen3.7 Max variant at FP8 or BF16. For the mid-variant at FP8 (~200 GB estimated), a single B200 may be tight; two B200 nodes give 384 GB of comfortable headroom.

The B200's higher memory bandwidth and FP8 Tensor Core throughput make it cost-efficient for inference: higher tokens/second per dollar compared to H100 or H200 on most generation workloads.

Current rates on Spheron for Spheron B200 instances:

  • On-demand: $9.30/hr per GPU
  • Spot: $2.74/hr per GPU

Check B200 availability before provisioning; supply varies by region and time.

#### 8x H200 SXM5 (1,128 GB VRAM): Full-Precision or 1M Context

Eight H200 nodes give 1,128 GB of aggregate VRAM, which fits the flagship model in BF16 and provides substantial KV cache headroom for 1M token context workloads.

Estimated cluster rates:

  • On-demand: ~$47.36/hr (8 × $5.92)
  • Spot: ~$26.48/hr (8 × $3.31)

Only provision the 8x H200 configuration if your application genuinely requires BF16 precision or 1M token context. For most production inference, 4x H200 at FP8 delivers competitive quality at half the cost.

Multi-GPU Sharding Patterns

Tensor parallelism (TP): Splits every attention layer across multiple GPUs. All GPUs participate in every forward pass, which minimizes time-to-first-token (TTFT). Use TP when latency is the primary concern. Set --tensor-parallel-size equal to your GPU count.

Expert parallelism (EP): Splits MoE expert layers across GPUs, so each GPU holds a subset of experts. The router dispatches tokens to whichever GPU holds the relevant expert. Combined with TP for attention layers, EP can significantly improve throughput on multi-GPU MoE configurations. Enable with --enable-expert-parallel. The EP size is computed automatically from your TP and data parallelism settings. For a deeper look at the scheduling mechanics behind these throughput gains, see the continuous batching and PagedAttention guide.

Pipeline parallelism (PP): Splits model layers sequentially across GPUs. Useful when the model is too large for full TP across available GPUs. For Qwen3.7 Max on 8x H100, pure TP is simpler and preferred unless you have VRAM constraints that require PP.

Recommended sharding by config:

ConfigTP SizeExpert ParallelPP
2x H200 (mid-variant)2Yes (MoE layers)No
4x H200 (flagship FP8)4YesNo
8x H100 SXM5 (flagship FP8)8YesNo
8x H200 (BF16 or 1M context)8YesNo

What Won't Work

  • Single H100 80 GB for the flagship MoE: Even at aggressive INT4 quantization, a 700B-1T total parameter model requires far more than 80 GB. You cannot fit the weights.
  • 4x H100 at FP8 for flagship MoE: 320 GB aggregate VRAM is insufficient for the estimated ~500-700 GB FP8 flagship weight set. Only viable with community INT4 quantization checkpoints, if available.
  • Any A100 at FP8: A100 lacks hardware FP8 Tensor Core support. Using --quantization fp8 on A100 will either error or silently run at lower precision. Use INT8 (--quantization bitsandbytes) on A100 instead.
  • H100 or H200 for 1M context with batch size > 1: At 1M token context, each sequence requires 262-524 GB of KV cache. Serving multiple concurrent sequences at 1M context is only practical on configurations with several TB of aggregate VRAM, which is outside the scope of standard GPU rental.
  • RTX 4090 or consumer GPUs: 24 GB VRAM. Completely insufficient for any Qwen3.7 Max variant.

Step-by-Step vLLM Deployment on Spheron

Prerequisites

Provision a GPU instance on Spheron matching your model size and context requirements. Follow the Spheron quick-start guide for provisioning steps. For vLLM-specific configuration, see the vLLM server guide.

SSH in and verify your GPU setup:

bash
nvidia-smi
# Verify GPU count, VRAM per GPU, driver version, and CUDA version
# Example expected output for 4x H200: 4 GPUs at 141 GB each, driver 550+

Install vLLM

bash
pip install vllm --upgrade
python -c "import vllm; print(vllm.__version__)"

Check the vLLM supported models list to confirm Qwen3.7 Max is listed before proceeding. If it is not listed yet, add --trust-remote-code to your serve commands as a fallback.

Download Model Weights

Important: As of 08 Jun 2026, Qwen3.7 Max open weights had not been confirmed as publicly released on https://huggingface.co/Qwen. The download commands below are based on expected naming conventions from prior Qwen releases and will not work until Alibaba publishes the model weights. Verify the actual repository names on Hugging Face before running any of these commands.

Alibaba naming conventions vary between releases: Qwen 3 uses Qwen/Qwen3-32B (no dot), Qwen 3.5 uses Qwen/Qwen3.5-27B, and Qwen 3.6 Plus is expected to follow a similar pattern. Confirm the exact naming at the Qwen Hugging Face organization page.

bash
# Mid-variant (verify exact repo name at https://huggingface.co/Qwen first)
huggingface-cli download Qwen/Qwen3.7Max-[VARIANT] \
    --local-dir /data/models/qwen37max

# FP8 quantized checkpoint (if released by Alibaba)
huggingface-cli download Qwen/Qwen3.7Max-[VARIANT]-FP8 \
    --local-dir /data/models/qwen37max-fp8

Use persistent storage to avoid re-downloading on instance restarts. The flagship variant at FP8 is estimated at 500-700 GB, so plan for at least 1 TB of persistent storage for the full weights plus intermediate files.

Launch Inference Server

Four representative configurations covering the main use cases:

bash
# Single H200 SXM5 -- mid-variant at FP8, standard context
vllm serve /data/models/qwen37max \
    --served-model-name Qwen/Qwen3.7Max \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# Single H200 SXM5 -- mid-variant at FP8, extended context (128K)
vllm serve /data/models/qwen37max \
    --served-model-name Qwen/Qwen3.7Max \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 131072 \
    --port 8000

# 4x H200 SXM5 -- flagship at FP8, expert parallelism enabled
vllm serve /data/models/qwen37max-flagship \
    --served-model-name Qwen/Qwen3.7Max-Flagship \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

# 8x H100 SXM5 -- flagship at FP8, full tensor parallel
vllm serve /data/models/qwen37max-flagship \
    --served-model-name Qwen/Qwen3.7Max-Flagship \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --port 8000

Notes on flags:

  • --quantization fp8 activates native FP8 Tensor Cores on H100, H200, and B200. On A100, use --quantization bitsandbytes for INT8 instead.
  • --enable-expert-parallel activates expert parallelism for MoE layers. This distributes expert weight tensors across GPUs and routes tokens to the GPU holding each expert. For large MoE models, enabling EP significantly improves throughput on multi-GPU configurations.
  • --gpu-memory-utilization 0.9 caps VRAM usage at 90%, leaving a buffer for runtime allocations. If you encounter OOM errors, lower this to 0.85.
  • --max-model-len sets the maximum sequence length the KV cache is pre-allocated for. Lower values free more VRAM for concurrent requests. Start at 32768 and increase only if your application requires it.

Test the API

bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3.7Max",
        "messages": [{"role": "user", "content": "Explain sparse MoE routing in three sentences."}],
        "max_tokens": 512
    }'

Python OpenAI client example:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen3.7Max",
    messages=[{"role": "user", "content": "Write a binary search implementation in Python."}],
    max_tokens=1024
)
print(response.choices[0].message.content)

Monitor Throughput

bash
# GPU utilization and memory in a separate terminal
nvidia-smi dmon -s pum -d 5

# vLLM Prometheus metrics endpoint
curl http://localhost:8000/metrics | grep -E "vllm_|request_"

Key metrics to watch during load testing:

  • vllm:gpu_cache_usage_perc: KV cache saturation. At 95%+, you are at the throughput ceiling for your --max-model-len.
  • vllm:e2e_request_latency_seconds_bucket: End-to-end latency distribution.
  • vllm:request_success_total vs vllm:request_failure_total: Ensure failure rate stays at zero under normal load.

Enable Prefix Caching

For workloads with repeated system prompts or shared context prefixes, vLLM's prefix caching can significantly reduce TTFT for cached requests:

bash
vllm serve /data/models/qwen37max \
    --quantization fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --port 8000

Prefix caching is particularly valuable for agentic workloads where the same tool definitions and system prompt appear in every request. For in-depth KV cache tuning, see the NVIDIA NVMe KV cache guide.

SGLang Deployment

SGLang is an alternative to vLLM with its own RadixAttention cache and expert-parallel scheduling optimizations. For some MoE workloads, particularly those with high cache hit rates or complex agentic call patterns, SGLang achieves better throughput than vLLM.

Install SGLang

bash
pip install sglang[all]
python -c "import sglang; print(sglang.__version__)"

For flash attention support (recommended for H100/H200/B200):

bash
pip install flash-attn --no-build-isolation

Launch SGLang Server

For an 8x H200 flagship configuration (tp=4 × dp=2 = 8 GPUs total):

bash
python -m sglang.launch_server \
    --model-path /data/models/qwen37max-flagship \
    --tp 4 \
    --dp 2 \
    --quantization fp8 \
    --host 0.0.0.0 \
    --port 30000 \
    --trust-remote-code

Flag notes:

  • --tp 4: Tensor parallelism degree. Combined with --dp, total GPUs required = tp × dp (4 × 2 = 8 in this example). For a 4-GPU cluster, use --tp 4 --dp 1 instead.
  • --dp 2: Data parallelism. Combined with TP, this sets the total batch processing capacity.
  • --quantization fp8: Activates FP8 on Hopper-architecture GPUs (H100, H200).
  • --trust-remote-code: Required if vLLM/SGLang does not yet have native Qwen3.7 Max support in the installed version.

For SGLang-specific MoE expert-parallel scheduling, check the SGLang changelog for flags related to expert dispatch optimization. SGLang's RadixAttention automatically handles prefix caching without the explicit --enable-prefix-caching flag.

Test the SGLang Endpoint

bash
curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "default",
        "messages": [{"role": "user", "content": "What are the main advantages of sparse MoE over dense transformers?"}],
        "max_tokens": 512
    }'

SGLang uses "model": "default" when serving a single model. For multi-model setups, check the SGLang docs at https://docs.sglang.ai for the model naming convention.

SGLang vs vLLM for Qwen3.7 Max

Use SGLang when:

  • Your workload has a high cache hit rate (shared system prompts, repeated tool definitions in agentic workflows).
  • You want RadixAttention for automatic prefix matching without explicit caching configuration.
  • You need finer control over expert dispatch scheduling for MoE inference.

Use vLLM when:

  • You need the broadest framework compatibility and the largest community of deployment guides.
  • You are troubleshooting a new model that does not yet have native support in either framework.
  • You need the --enable-prefix-caching and --enable-chunked-prefill flags, which have extensive vLLM documentation.

For a benchmark comparison across both frameworks on similar MoE models, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Performance Benchmarks

The following figures are estimates based on architectural extrapolation from DeepSeek V4 benchmarks (published in the DeepSeek V4 deployment guide) and published vLLM MoE performance data. Qwen3.7 Max-specific benchmarks are not available at time of writing. Actual results will vary based on hardware configuration, batch size, request length, and quantization precision.

Hardware ConfigThroughput (tok/s, est.)TTFT p50 (512-token prompt, est.)ITL (est.)On-Demand CostCost/M tokens (est.)
8x H100 SXM5 FP8~1,700-1,900~15-20s~15ms~$40.08/hr~$5.86-6.55/M
4x H200 SXM5 FP8~1,400-1,700~18-25s~18ms~$23.68/hr~$3.87-4.70/M
2x B200 SXM6 FP8~2,400-3,200~12-18s~10ms~$18.60/hr~$1.61-2.15/M
8x H200 SXM5 BF16~1,200-1,500~20-30s~20ms~$47.36/hr~$8.77-10.96/M

Notes on these estimates:

  • Throughput figures are total server throughput across all concurrent requests, not single-request latency.
  • Cost/M tokens is derived as: (hourly_rate / throughput_tok_s / 3600) × 1,000,000.
  • B200 throughput estimate uses MLPerf Inference v6.0 data showing ~17,500 tok/s for Llama 2 70B offline mode. For a larger MoE model, the per-GPU throughput advantage over H100/H200 remains but at lower absolute numbers.
  • TTFT estimates assume a single active request. Under concurrent load, TTFT increases with batch size.
  • All figures are estimates. Measure actual performance on your workload using the vLLM metrics endpoint or nvidia-smi dmon before making production capacity decisions.

Cost/M Tokens: On-Demand vs Spot

Spot instances offer significant savings on Spheron, with the discount varying by GPU type. H100 SXM5 and B200 SXM6 spot rates are roughly 70% below on-demand; H200 SXM5 spot is approximately 44% below on-demand. On the 4x H200 configuration:

  • On-demand: ~$3.87-4.70/M tokens (estimated)
  • Spot: ~$2.16-2.63/M tokens (estimated, using $3.31/hr per H200 spot rate)

On the 8x H100 configuration, spot pricing at $1.49/hr per GPU ($11.92/hr total) cuts the on-demand rate by ~70%, reducing per-token cost from ~$5.86-6.55/M to roughly $1.74-1.95/M. For batch inference workloads that tolerate interruptions, spot is always worth enabling.

Qwen3.7 Max vs DeepSeek V4: Benchmarks and Cost

Both models are frontier-tier MoE deployments. The hardware and cost requirements are similar: both require 4x H200 or 8x H100 at FP8, and both carry similar hourly rates for equivalent configurations.

The differentiation is in task-specific quality and architecture:

DimensionQwen3.7 MaxDeepSeek V4
Total parameters (est.)~700B-1T (unconfirmed)~1T
Active parameters (est.)~37-42B (unconfirmed)~37B
Context window1M tokens1M tokens (via DSA)
MoE routingUpdated top-K (details unconfirmed)Top-K via DeepSeek Sparse Attention
Function callingOpenAI-compatible, parallel callsOpenAI-compatible, parallel calls
MMLU (est./reported)~91 (unconfirmed)~88 (published)
HumanEval (est./reported)~93 (unconfirmed)~86 (published)
Hardware minimum4x H200 FP84x H200 FP8
LicenseTBDMIT

The license situation for Qwen3.7 Max is not confirmed at time of writing. Earlier Qwen models have used Apache 2.0 or similar permissive licenses, but the exact terms for 3.7 Max were not published as of June 2026. DeepSeek V4 is MIT-licensed per published information.

If function-calling reliability is critical to your use case: early benchmarks on structured output accuracy favor Qwen3.7 Max based on limited third-party reports, but the margin is small and workload-dependent. Run both models on your actual tool-calling task before deciding. For a structured approach to measuring function-call accuracy, see the BFCL and tau-Bench tool calling benchmarks guide.

Function Calling, Structured Output, and Tool Use

Qwen3.7 Max supports the full OpenAI-compatible tool-use interface: function definitions, parallel tool calls, JSON mode, and strict schema enforcement.

Comparison: Qwen3.7 Max vs GPT-6 vs DeepSeek V4

FeatureQwen3.7 MaxGPT-6DeepSeek V4
JSON modeYesYesYes
Strict schema enforcementYesYesYes
Parallel tool callsYesYesYes
Tool-use formatOpenAI-compatibleNative OpenAIOpenAI-compatible
Streaming with tool callsYes (via vLLM)Yes (API)Yes (via vLLM)
Nested tool callsYesYesYes
Known failure modesOccasional argument hallucination on complex schemas (third-party reports)Rare; best-of-class JSON accuracySimilar to Qwen3.7 Max

Function Calling Example

json
{
  "model": "Qwen/Qwen3.7Max",
  "messages": [
    {"role": "user", "content": "What is the current H100 on-demand price on Spheron?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_gpu_pricing",
        "description": "Fetch current GPU rental pricing from Spheron",
        "parameters": {
          "type": "object",
          "properties": {
            "gpu_model": {
              "type": "string",
              "description": "GPU model name (e.g., 'H100 SXM5', 'H200 SXM5')"
            },
            "pricing_type": {
              "type": "string",
              "enum": ["on-demand", "spot", "reserved"]
            }
          },
          "required": ["gpu_model"]
        }
      }
    }
  ],
  "max_tokens": 512
}

The model returns a tool_calls array in the response with the function name and arguments. For parallel tool calls, the model returns multiple entries in tool_calls in a single response, which you then execute concurrently before sending the results back.

Structured Output with Pydantic

python
from openai import OpenAI
from pydantic import BaseModel

class GPUConfig(BaseModel):
    gpu_model: str
    count: int
    vram_per_gpu_gb: int
    recommended_quantization: str
    estimated_cost_per_hour: float

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.beta.chat.completions.parse(
    model="Qwen/Qwen3.7Max",
    messages=[{
        "role": "user",
        "content": "Recommend a GPU configuration for running a ~500GB FP8 MoE model on Spheron."
    }],
    response_format=GPUConfig
)
config = response.choices[0].message.parsed
print(f"Recommended: {config.count}x {config.gpu_model} at ${config.estimated_cost_per_hour}/hr")

For a deep dive into structured output benchmarks across models, see the structured output and function calling guide.

Production Checklist

Quantization Options

QuantizationPrecisionHardware RequirementQuality Impact
BF162 bytes/paramAny modern GPU (H100, H200, B200, A100)Baseline (none)
FP81 byte/paramHopper+ (H100, H200) or Blackwell (B200)Minimal (<1% on most benchmarks)
INT81 byte/paramAny GPU including A100Small (~1-2% on most benchmarks)
INT4/AWQ0.5 bytes/paramAny GPUModerate (2-5% on complex tasks)
MXFP40.5 bytes/paramB200 (Blackwell FP4 Tensor Cores)Moderate; hardware-native

FP8 on H100/H200 is the recommended path: native hardware support, half the VRAM of BF16, and minimal quality loss. For teams running on A100, use INT8. For community checkpoints in AWQ or GPTQ format, verify quantization quality on your specific benchmark before moving to production.

For a comprehensive guide to quantization decision-making, see the AWQ quantization guide.

Autoscaling on Spheron

Spheron supports horizontal scaling across multiple GPU instances for load balancing. The basic pattern is:

  1. Deploy multiple vLLM server instances (one per GPU cluster node).
  2. Put a load balancer (nginx, HAProxy, or a managed LB) in front.
  3. Route requests round-robin or based on vllm:gpu_cache_usage_perc from each node's /metrics endpoint.
  4. Monitor latency and scale new nodes when p95 TTFT exceeds your SLA target.

For autoscaling configuration on Spheron, see the Spheron documentation for instance management APIs and horizontal scaling guides.

Observability

Three observability layers for production Qwen3.7 Max deployments:

GPU-level: nvidia-smi dmon -s pum -d 5 gives per-GPU utilization, power, and temperature at 5-second intervals. For continuous monitoring, DCGM (NVIDIA Data Center GPU Manager) provides the same data as Prometheus metrics.

Framework-level: vLLM exposes a /metrics endpoint in Prometheus format. Key metrics: vllm:request_latency_seconds, vllm:request_success_total, vllm:gpu_cache_usage_perc, vllm:num_requests_running. Set up a Prometheus + Grafana stack to visualize these over time.

Application-level: For agentic workloads with multi-turn conversations, track token usage per session, function call success rate, and response truncation rate (requests hitting max_tokens). High truncation rates usually mean max_tokens is too low for your workload. For production batch processing patterns that complement real-time serving, see the batch LLM inference guide.

For OpenTelemetry tracing integration with vLLM, see the vLLM documentation for the --otlp-traces-endpoint flag, which exports span data for each request.

Spheron GPU Pricing for Qwen3.7 Max Workloads

Live pricing from the Spheron GPU pricing API, fetched 2026-06-08. Multi-GPU cluster rates are calculated as: (per-GPU on-demand min) × (GPU count), which is an approximation since actual bundle offers may vary.

GPU ConfigOn-Demand ($/hr)Spot ($/hr)Spot SavingsMonthly (24/7 on-demand)
1x H200 SXM5 (mid-variant)$5.92$3.31~44%~$4,262
2x H200 SXM5~$11.84~$6.62~44%~$8,525
4x H200 SXM5 (flagship entry)~$23.68~$13.24~44%~$17,050
8x H100 SXM5 (flagship alt.)~$40.08~$11.92~70%~$28,858
1x B200 SXM6 (smaller variant)$9.30$2.74~71%~$6,696
8x H200 SXM5 (BF16 or 1M context)~$47.36~$26.48~44%~$34,099

Pricing fluctuates based on GPU availability. The prices above are based on 08 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Cost vs Alibaba Cloud API

Alibaba Cloud API pricing for Qwen3.7 Max is not confirmed at time of writing. Verify current rates at dashscope.aliyun.com if you are evaluating API vs self-hosted.

Self-hosting on Spheron becomes more cost-efficient than API pricing once daily token volume is high enough to amortize the fixed hourly rate. The break-even point depends on Alibaba's API rate per million tokens, which has not been confirmed at time of writing. Check current DashScope pricing at dashscope.aliyun.com and divide the 4x H200 on-demand rate (~$23.68/hr) by that per-token rate to find your daily break-even volume. Below that threshold, API pricing avoids infrastructure management overhead.

At sustained high-volume inference, the 4x H200 FP8 configuration at ~$23.68/hr provides predictable fixed-cost inference with full data privacy and no rate limits.

Troubleshooting

OOM at launch with expected VRAM budget: Reduce --max-model-len first. Cutting from 131072 to 32768 can free 30-100 GB of KV cache pre-allocation depending on model size and architecture. If still OOM, lower --gpu-memory-utilization to 0.85.

Model class not found: Add --trust-remote-code to the serve command and verify you are on the latest vLLM release. Qwen3.7 Max requires native kernel support that may not be present in older releases.

Slow throughput on multi-GPU MoE: Enable --enable-expert-parallel if not already set. Without EP on a large MoE model, all expert computations are serialized on the master GPU, which eliminates the throughput benefit of multi-GPU deployment.

Tensor parallel rank mismatch: Ensure --tensor-parallel-size evenly divides your GPU count. Common valid values: 2, 4, 8. Passing --tensor-parallel-size 6 on an 8-GPU cluster will error.

Expert parallel auto-size errors: If you set both --tensor-parallel-size and --enable-expert-parallel, check the vLLM logs for the computed EP size. If the EP size exceeds the number of experts in the model, vLLM may error. Reduce TP size or check the model's expert count.

NCCL timeout on NVLink-less configurations: For PCIe-connected multi-GPU setups, inter-GPU communication is slower and NCCL timeouts may occur during initialization. Set NCCL_TIMEOUT=1800 and NCCL_DEBUG=INFO to diagnose. NVLink-connected H100 SXM5 or H200 SXM5 clusters avoid this issue.

INT4 quality degradation on complex tasks: Community INT4 checkpoints (AWQ or GPTQ) for Qwen3.7 Max may vary in calibration quality. If you observe significantly worse function-calling or structured output accuracy compared to FP8, switch to a different quantization checkpoint or move to INT8.

A100 FP8 errors: A100 does not have hardware FP8 Tensor Core support. Switch to --quantization bitsandbytes for INT8 quantization on A100 hardware.


Qwen3.7 Max runs well on H200 and B200 hardware, and the cost gap versus H100 closes quickly at the throughput levels MoE models reach. Compare current on-demand and spot rates before provisioning.

H200 SXM5 on Spheron | Spheron B200 instances | View all GPU pricing

STEPS / 06

Quick Setup Guide

  1. Choose GPU configuration for Qwen3.7 Max

    Qwen3.7 Max is a frontier MoE model announced in May 2026. Based on third-party analyses, the flagship variant is expected to require 4x H200 SXM5 (564GB total VRAM) or 8x H100 SXM5 (640GB total VRAM) at FP8 precision. A single H200 SXM5 (141GB) may be sufficient for smaller or more aggressively quantized sub-variants if Alibaba releases them. Identify your target context length before provisioning: 1M token context requires substantial additional VRAM for KV cache storage beyond the model weights, which pushes minimum requirements to 8x H200 or larger for full-context workloads at any scale. Do not plan hardware based on active parameter count alone, since all MoE weight tensors must reside in VRAM even though only a fraction activates per forward pass.

  2. Provision a GPU instance on Spheron

    Go to app.spheron.ai and select your GPU configuration based on your model size and context length target. For standard production inference at 32K-128K context, a single H200 SXM5 handles smaller MoE sub-variants and a 4x H100 SXM5 cluster fits the mid-range configurations at INT4 quantization. For the flagship FP8 configuration, provision a 4x H200 SXM5 or 8x H100 SXM5 cluster with NVLink interconnect where available. SSH in after provisioning and run nvidia-smi to verify GPU count, VRAM, and driver version before installing any software.

  3. Install vLLM or SGLang with Qwen3.7 Max support

    Install the latest vLLM release with 'pip install vllm --upgrade', then verify the version with 'python -c "import vllm; print(vllm.__version__)"'. Check the vLLM supported models list at https://docs.vllm.ai/en/latest/models/supported_models.html to confirm Qwen3.7 Max support status before starting a deployment. If the model class is not recognized, add --trust-remote-code as a temporary fallback while waiting for an official vLLM release that includes native Qwen3.7 Max support. For SGLang, install with 'pip install sglang[all]' and verify the version similarly. Both frameworks release updates frequently when new frontier models land.

  4. Download Qwen3.7 Max weights from Hugging Face

    Check the Hugging Face Qwen organization page at https://huggingface.co/Qwen for available model repositories before running any download command. Alibaba naming conventions vary between model releases, and the exact repository name for Qwen3.7 Max had not been confirmed at time of writing. Once confirmed, use 'huggingface-cli download Qwen/[REPO-NAME] --local-dir /data/models/qwen37max' to download the weights. Store weights on a persistent volume so you do not re-download on instance restart. Official FP8 quantized checkpoints will follow a naming pattern like Qwen3.7Max-[VARIANT]-FP8 if Alibaba releases them alongside the base weights.

  5. Launch the inference server

    For vLLM on a single H200 SXM5 with FP8 quantization, use: 'vllm serve [MODEL_PATH] --quantization fp8 --gpu-memory-utilization 0.9 --max-model-len 32768 --port 8000'. For the flagship MoE on 8x H100 SXM5, add '--tensor-parallel-size 8 --enable-expert-parallel'. For SGLang on the flagship 8x H100 SXM5 configuration: 'python -m sglang.launch_server --model-path [MODEL_PATH] --tp 8 --dp 1 --quantization fp8 --host 0.0.0.0 --port 30000'. Adjust --max-model-len based on your context length target and available VRAM after model weights load. Run nvidia-smi immediately after the server starts to verify VRAM usage is within expected bounds.

  6. Validate the deployment with curl health checks

    Once the server is running, test the OpenAI-compatible endpoint with a simple curl request to http://localhost:8000/v1/chat/completions using the Content-Type: application/json header and a JSON body with model name, a simple user message, and max_tokens of 256. Inspect the response for a valid JSON completion in choices[0].message.content. If the response is malformed or missing, check server logs first. Monitor ongoing throughput with 'nvidia-smi dmon -s pum -d 5' in a separate terminal. Check the vLLM metrics endpoint at http://localhost:8000/metrics for Prometheus-format data including TTFT, ITL, and KV cache usage percentage.

FAQ / 05

Frequently Asked Questions

Qwen3.7 Max total parameter count is unconfirmed at time of writing. Based on third-party analyses and comparisons to DeepSeek V4, the flagship variant is expected to require 4x H200 SXM5 (564GB at FP8) or 8x H100 SXM5 (640GB at FP8) as a practical minimum. A single H200 SXM5 (141GB) may handle smaller sub-variants if Alibaba releases them. Do not plan GPU config based on active parameter count alone: the full MoE weight set must reside in VRAM even though only a fraction activates per token.

Both models are frontier MoE deployments requiring similar hardware (4x H200 or 8x H100 at FP8). Benchmark comparisons depend on task type: Qwen3.7 Max reportedly scores higher on function-calling accuracy and structured output tasks based on early third-party analyses, while DeepSeek V4 leads on specific competitive coding benchmarks per published data. Cost-wise, both run on the same hardware tier at equivalent hourly rates. The differentiation comes from throughput at your actual batch size and context length, so benchmark both on your workload before committing to either.

Yes. Qwen3.7 Max supports OpenAI-compatible function calling, JSON mode, and strict schema enforcement, continuing the tool-use format established in earlier Qwen generations. Parallel tool calls (multiple function calls in a single response) are supported. When serving with vLLM or SGLang, the model uses the standard /v1/chat/completions endpoint with the tools parameter, meaning existing client code requires no changes beyond updating the model name and base URL.

FP8 is the recommended precision for H100, H200, and B200 hardware, all of which support native FP8 Tensor Cores. FP8 reduces model storage by roughly 50% versus BF16 with minimal quality loss on most benchmarks. A100 hardware lacks hardware FP8 support, so use INT8 via bitsandbytes on A100 instead. Community AWQ and GPTQ quantized checkpoints may appear on Hugging Face after the model weights are publicly released. Check the Qwen organization page for official quantization releases before using community versions, since unofficial checkpoints vary significantly in calibration quality.

Check the vLLM supported models list at https://docs.vllm.ai/en/latest/models/supported_models.html for current status. Qwen3.7 Max uses an updated MoE architecture building on Qwen 3.6 Plus; the required vLLM version depends on when Alibaba releases the weights and when vLLM adds native support. Always install the latest vLLM release when working with models announced after May 2026. If vLLM does not recognize the Qwen3.7 Max model class, add --trust-remote-code as a temporary fallback.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.