Tutorial

Deploy GLM-5.2 on GPU Cloud: Self-Host Z.ai's 744B Coding MoE with 1M Context (2026 Guide)

GLM-5.2deploy GLM-5.2GLM-5.2 GPU requirementsGLM-5.2 self-hostvLLMSGLangMoE InferenceGPU Cloud1M ContextLLM Deployment
Deploy GLM-5.2 on GPU Cloud: Self-Host Z.ai's 744B Coding MoE with 1M Context (2026 Guide)

GLM-5.2 shipped on June 13, 2026. Z.ai announced the weights would be released under MIT license, with a staggered open-weight rollout scheduled for the days following launch. The model has a 744B MoE architecture that keeps the same ~40B active parameter count as GLM-5.1 but adds two things that matter for production: a 1M-token context window and a coding-first training focus that Z.ai claims beats GPT-5.5 on coding at roughly one-sixth the API cost. This guide covers the full deployment path: VRAM sizing for both standard and 1M-context workloads, vLLM and SGLang configuration, and cost math using live Spheron pricing.

If you are coming from the GLM-5.1 deployment guide, the hardware setup is nearly identical with one important difference: 1M-context workloads require FP8 KV cache and leave less headroom on 8x H200, so node sizing needs careful attention.

What's New in GLM-5.2 vs GLM-5.1

FeatureGLM-5.1GLM-5.2
Total parameters754B744B
Active parameters~40B~40B
Context window200K tokens1M tokens
Reasoning modesNoneHigh, Max
Training focusMixed coding + reasoningCoding-first
LicenseMITMIT
Release dateApril 7, 2026June 13, 2026

The parameter reduction from 754B to 744B is minor and does not materially change VRAM requirements. The meaningful upgrades are the 1M-token context and the dual reasoning modes.

High vs Max reasoning modes in practice: High mode produces responses with lighter chain-of-thought, which means shorter sequences and lower per-token latency. Max mode activates deeper reasoning chains for harder agentic tasks. At inference time, this affects token budget, not GPU configuration. High mode is the right default for interactive coding agents; Max mode makes sense for batch agentic pipelines where latency is less critical than quality on hard problems. Reasoning mode names are per Z.ai's June 13, 2026 release notes; verify against the official model card if Z.ai updates the naming after this guide was written.

GPU Memory Requirements

The same MoE misconception that applied to GLM-5.1 applies here: only ~40B parameters activate per forward pass, but all 744B must live in GPU memory. Expert routing cannot page weights in and out at inference time without prohibitive latency.

Weight memory by quantization:

  • BF16: 744B x 2 bytes = ~1,488 GB. Requires multi-node setup.
  • FP8: 744B x 1 byte = ~744 GB. Fits on 8x H200 SXM5 with headroom.
  • AWQ INT4: 744B x 0.5 bytes = ~372 GB, plus 5-10% for activations. Fits on 4x H200 or 5x A100 80GB.

For VRAM planning across other models, the GPU requirements cheat sheet for 2026 covers the memory math in a single lookup table.

1M-Context KV Cache Sizing

The 1M-token context window changes the VRAM picture significantly. KV cache memory grows linearly with sequence length:

kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For GLM-5.2 at 1M tokens (1,048,576 sequence length), approximate figures based on the model's MoE architecture:

KV PrecisionSeq LengthBatch SizeApprox KV Cache
FP16131,072 (128K)4~80 GB
FP161,048,576 (1M)1~160 GB
FP81,048,576 (1M)1~80 GB
FP81,048,576 (1M)4requires OOM management

FP8 KV cache (--kv-cache-dtype fp8_e5m2) is mandatory for 1M-context workloads. At FP16 with 744 GB of weights already loaded on 8x H200 (1,128 GB total), you have roughly 384 GB of headroom. Serving batch=1 at 1M context already consumes ~160 GB of that at FP16. FP8 KV halves those numbers to ~80 GB for batch=1. For 1M-context, also set --max-num-seqs 1 or --max-num-seqs 2 to avoid OOM at peak concurrency.

For a deep dive on KV cache memory math and optimization strategies, see the KV cache optimization guide.

GPU configurations that work for GLM-5.2:

GPUVRAM per cardCount for FP8Count for AWQ INT4
H200 SXM5141 GB8x (1,128 GB)4x (564 GB)
B200 SXM6192 GB8x (1,536 GB)4x (768 GB)
H100 SXM580 GB10x (800 GB)5x (400 GB)
A100 80GB SXM480 GB10x (800 GB)5x (400 GB)

B200 is particularly well-suited for 1M-context workloads: its higher HBM3e bandwidth reduces time-to-first-token at long sequence prefill, and 8x B200 (1,536 GB) gives significantly more KV headroom than 8x H200 (1,128 GB). See available Spheron B200 instances for current spot pricing.

Spheron GPU Configurations and Pricing

Pricing fetched from the Spheron API on 17 Jun 2026:

ConfigurationPrecisionTotal VRAMSpot PriceOn-Demand Price
8x H200 SXM5FP81,128 GB$14.56/hr (8 x $1.82)$38.72/hr (8 x $4.84)
4x H200 SXM5AWQ INT4564 GB$7.28/hr (4 x $1.82)$19.36/hr (4 x $4.84)
8x B200 SXM6FP8, 1M context1,536 GB$21.68/hr (8 x $2.71)$59.28/hr (8 x $7.41)

Pricing fluctuates based on GPU availability. The prices above are based on 17 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

The 8x H200 SXM5 instances FP8 spot configuration at $14.56/hr is the cost-optimal entry point for production. For 1M-context workloads where you need the KV headroom, 8x B200 at $21.68/hr spot gives 408 GB more total VRAM and better prefill throughput at long sequences.

Step-by-Step Deployment with vLLM

Step 1: Provision the GPU Node on Spheron

Log in to app.spheron.ai, select an 8x H200 SXM5 or 8x B200 SXM6 instance. For standard workloads, spot is the right choice. For latency-sensitive production serving, use on-demand. SSH into the instance once it provisions (see the SSH connection guide).

Set storage to at least 800 GB NVMe for the FP8 weights.

Step 2: Install Dependencies

bash
# CUDA 12.4+ required
pip install "vllm>=0.19.0" huggingface_hub
export HF_TOKEN=<your_token>

Step 3: Download Model Weights

bash
# FP8 variant (~800 GB on NVMe)
huggingface-cli download zai-org/GLM-5.2-FP8 \
  --local-dir ./glm5-2-fp8 \
  --repo-type model

The weights follow Z.ai's naming convention from GLM-5.1 (zai-org/GLM-5.1-FP8). Verify the repo exists at Hugging Face before running, as weights may still be uploading in the days following the June 13 launch. If the FP8 repo is not yet available, use the base zai-org/GLM-5.2 weights and quantize on the fly with vLLM's --load-format auto or produce AWQ INT4 via AutoAWQ.

Step 4: Launch the vLLM Server

For standard workloads (up to 128K context):

bash
python -m vllm.entrypoints.openai.api_server \
  --model ./glm5-2-fp8 \
  --served-model-name glm5-2-fp8 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --max-num-seqs 32 \
  --port 8000

For 1M-context workloads (requires 8x H200 or 8x B200 with FP8 KV):

bash
python -m vllm.entrypoints.openai.api_server \
  --model ./glm5-2-fp8 \
  --served-model-name glm5-2-fp8 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 1048576 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --max-num-chunked-tokens 32768 \
  --max-num-seqs 1 \
  --port 8000

Flag explanations:

FlagPurpose
--tensor-parallel-size 8Splits model shards across 8 GPUs
--enable-expert-parallelDistributes MoE experts across GPUs for parallel routing
--kv-cache-dtype fp8_e5m2Halves KV cache memory; mandatory for 1M-context
--enable-chunked-prefillEssential at 1M context to prevent single long prompts from blocking the batch
--max-num-seqs 1Required for 1M-context to avoid OOM at peak concurrency

Note: in some vLLM v0.19.x minor versions the expert parallelism flag may be --moe-expert-parallel-size instead of --enable-expert-parallel. Check your installed version's release notes if the flag is not recognized.

See vLLM vs TensorRT-LLM vs SGLang benchmarks for a detailed throughput comparison across serving frameworks.

Step 5: Validate

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm5-2-fp8",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a linked list."}],
    "max_tokens": 512
  }'

Benchmark throughput with vLLM's benchmark_serving.py before routing production traffic.

Serving GLM-5.2's 1M-Token Context Window

The 1M-token context window is the feature that separates GLM-5.2 from GLM-5.1 for agentic coding workloads. Serving it requires more than just increasing --max-model-len. Here is what actually needs tuning.

Chunked Prefill

At 1M tokens, a single prefill call processes one million tokens before generating a single output token. Without chunked prefill, this blocks your entire batch for seconds. With chunked prefill enabled, vLLM processes the long prompt in chunks while interleaving other requests:

bash
--enable-chunked-prefill
--max-num-chunked-tokens 32768

Set --max-num-chunked-tokens to a value your GPU can handle without latency spikes on shorter requests. 32,768 is a good starting point; increase if throughput is the bottleneck, decrease if short-request latency degrades.

Prefix Caching

For agentic coding workflows where multiple agent steps share a long system prompt or a large codebase context, prefix caching reduces TTFT on subsequent calls:

bash
--enable-prefix-caching

This tells vLLM to cache the KV state of shared prompt prefixes and reuse them across requests. If your agentic workflow repeatedly sends the same 50K-token codebase context, prefix caching turns the second and subsequent requests into incremental fills. It has minimal benefit when every request has a unique prompt.

FP8 KV Cache

--kv-cache-dtype fp8_e5m2 cuts KV memory in half with minimal quality impact. At 1M context, this flag is not optional; it is what makes 1M context viable on 8x H200.

NVMe KV Offloading

If you are running out of VRAM at peak batch concurrency, NVMe KV offloading lets vLLM spill overflow KV blocks to fast NVMe storage rather than returning OOM errors. See NVMe KV cache offloading for LLM inference for the setup details. This is a fallback, not a substitute for correct VRAM sizing.

GLM-5.2 shares attention optimizations with similar MoE architectures that reduce attention FLOPs by ~98% at 128K-1M context ranges, which directly impacts prefill latency on long coding context inputs.

Deploy GLM-5.2 with SGLang

SGLang works well for multi-turn agentic coding workflows. Its RadixAttention caches the KV state of shared prefixes across agent steps, which directly helps GLM-5.2's coding-first use case where agent steps often share a large codebase context window.

bash
pip install 'sglang[all]>=0.5.10'

python -m sglang.launch_server \
  --model-path ./glm5-2-fp8 \
  --tp 8 \
  --quantization fp8 \
  --enable-moe-ep \
  --context-length 131072 \
  --port 30000

For 1M-context workloads:

bash
python -m sglang.launch_server \
  --model-path ./glm5-2-fp8 \
  --tp 8 \
  --quantization fp8 \
  --enable-moe-ep \
  --context-length 1048576 \
  --mem-fraction-static 0.88 \
  --kv-cache-dtype fp8_e5m2 \
  --port 30000

--enable-moe-ep is SGLang's expert parallelism flag, equivalent to vLLM's --enable-expert-parallel. Without it, tensor parallelism copies all expert weights to all GPUs, which wastes memory and reduces effective batch capacity.

RadixAttention is particularly useful for GLM-5.2 agentic coding use cases: when each agentic step sends the full codebase context plus the new task, RadixAttention keeps the codebase KV in cache and computes only the incremental portion. For a full production walkthrough, see the SGLang production deployment guide.

GLM-5.2 vs GLM-5.1 vs Kimi K2.7 Code vs DeepSeek V4: Performance and Cost

ModelTotal ParamsActive ParamsFP8 VRAMContextSpheron Spot (8x H200)Reasoning
GLM-5.1754B~40B~754 GB200K$14.56/hrNone
GLM-5.2744B~40B~744 GB1M$14.56/hrHigh, Max
Kimi K2.7 Code~1T~32B~1,000 GB256K~$14.56/hrNone
DeepSeek V4 Pro~1.6T~49B~1,600 GB1M~$14.56/hrNone

GLM-5.2 wins over Kimi K2.7 Code when you need an MIT license with no usage restrictions, a 1M-token context window for long codebase ingestion, and Z.ai's reported superiority on coding benchmarks at similar hardware cost. Kimi K2.7 Code is a stronger pick for tasks where raw parameter scale matters and the Modified MIT license is acceptable. See the Kimi K2.7 Code deployment guide for its full hardware configuration.

GLM-5.2 vs DeepSeek V4 comes down to training focus: GLM-5.2 is tuned for coding specifically, while DeepSeek V4 is a general frontier model with 1M context. If your workload is primarily agentic coding and not general reasoning or math, GLM-5.2's coding-first training shows measurable benchmark gains. See the DeepSeek V4 deployment guide for a side-by-side configuration comparison.

Cost-per-Token: Self-Hosted on Spheron vs Z.ai API

Hardware Cost Baseline

From current Spheron pricing (17 Jun 2026):

ConfigurationSpotOn-Demand
8x H200 SXM5 (FP8, standard)$14.56/hr$38.72/hr
8x B200 SXM6 (FP8, 1M context)$21.68/hr$59.28/hr

Throughput and Cost-per-Million-Token

At realistic FP8 throughput for GLM-5.2 at 32K context, expect roughly 150-200 tokens/sec per 8-GPU node, representative from similar 744-754B MoE deployments. Using 175 tokens/sec as the midpoint:

cost_per_million = (hourly_rate / (tokens_per_second × 3600)) × 1_000_000
ConfigRateCost/Million Tokens
8x H200 spot$14.56/hr~$23/M
8x H200 on-demand$38.72/hr~$62/M
8x B200 spot$21.68/hr~$34/M

Z.ai's GLM-5.2 API is priced at roughly $1.40/M input tokens and $4.40/M output tokens. GPT-5.5 runs about $5/M input and $30/M output, making GLM-5.2's API roughly 7x cheaper on output for comparable coding quality.

Break-even analysis: At $14.56/hr on spot, running 24/7 costs roughly $10,483/month for an 8x H200 cluster. Against Z.ai's API at $4.40/M output tokens, the break-even sits at roughly 2.4B output tokens per month. That volume requires sustained high-concurrency batch workloads. Teams that reach that scale will see meaningful savings from self-hosting on Spheron; below it, Z.ai's API is more cost-efficient than maintaining a dedicated cluster. Self-hosting also makes sense when data privacy or dedicated SLAs are requirements regardless of volume.

Pricing fluctuates based on GPU availability. The prices above are based on 17 Jun 2026 and may have changed. Check current GPU pricing at spheron.network/pricing/ for live rates.

Quantization Options

FP8

FP8 is the recommended production path. H200 SXM5 and B200 SXM6 have native FP8 tensor cores, so there is no software emulation overhead. Quality regression versus BF16 is minimal on most coding benchmarks. Z.ai publishes the official FP8 variant at zai-org/GLM-5.2-FP8, following the same convention as GLM-5.1. Verify the repo exists on Hugging Face before downloading, as weights may still be uploading shortly after the June 13 launch.

AWQ INT4

AWQ INT4 cuts VRAM from ~744 GB to ~372 GB, which lets you fit GLM-5.2 on 4x H200 at $7.28/hr spot instead of 8x H200 at $14.56/hr. The tradeoff is a 1-3% quality regression on coding benchmarks. Z.ai may not publish an official AWQ variant; if not, self-quantize using AutoAWQ from the base zai-org/GLM-5.2 weights. For a complete quantization workflow, see the AWQ quantization guide for LLM deployment.

Production Optimization

Batching and Throughput

  • Set --max-num-seqs to 2x your expected peak concurrent requests for standard workloads. For 1M-context, drop to 1-2 to avoid OOM.
  • Enable --enable-chunked-prefill to prevent long coding context prompts from starving shorter requests in the same batch.
  • Set --max-num-batched-tokens to at least 8,192 for good GPU utilization on standard workloads.

Standard continuous batching and paged attention tuning guidance applies here: increase --max-num-batched-tokens for throughput, decrease for latency.

Expert Parallelism

GLM-5.2 uses the same fine-grained MoE routing as GLM-5.1. With --enable-expert-parallel in vLLM or --enable-moe-ep in SGLang, the runtime distributes experts across GPUs and routes tokens via NVLink/NVSwitch. Without expert parallelism, tensor parallelism copies all expert weights to all GPUs and wastes memory.

GLM-5.2's 40B-active expert routing benefits from DeepGEMM's FP8 grouped GEMM kernels for MoE efficiency. See the DeepEP and DeepGEMM setup guide for installation and integration steps. For a broader treatment of expert parallelism across major MoE architectures, see the MoE inference optimization guide.

KV Cache Tuning

  • FP8 KV cache (--kv-cache-dtype fp8_e5m2) is mandatory at 1M context and a good default for standard workloads too.
  • Set --max-model-len based on actual workload. For agentic coding agents, 32K-131K is usually the practical range; reserve 1M context for codebase ingestion tasks.
  • For overflow when context exceeds VRAM headroom, see the NVMe Offloading section above.

If VRAM headroom for 1M context is a constraint on 8x H200 today, the NVIDIA R100 (Rubin) pre-order page is worth watching. At 288 GB HBM3e per chip, a 4-GPU R100 node would hold the full FP8 GLM-5.2 weight set where 8x H200 is required now, with enough remaining VRAM for 1M-context KV cache at batch=2 or more.


GLM-5.2's 744B coding MoE is now MIT-licensed and ready to self-host. Spin up an 8x H200 spot cluster on Spheron and start serving 1M-context agentic coding requests at a fraction of API prices.

H200 SXM5 on Spheron → | Spheron B200 instances → | View all pricing →

Get started on Spheron →

STEPS / 05

Quick Setup Guide

  1. Calculate VRAM requirements

    For FP8 weights (744B x 1 byte = 744 GB) plus KV cache overhead, provision 8x H200 SXM5 (1,128 GB total). For 1M-context workloads, budget an additional 80-160 GB for KV cache on top of weights. AWQ INT4 reduces weight memory to ~372 GB (4x H200 or 5x A100), but 1M context at INT4 still needs headroom planning.

  2. Provision a GPU node on Spheron

    Log in to app.spheron.ai, select an 8x H200 SXM5 or 8x B200 SXM6 spot instance, set storage to 800 GB minimum, and choose Ubuntu 22.04. SSH into the instance using the guide at docs.spheron.ai/connecting/ssh-connection.

  3. Install dependencies and download GLM-5.2 weights

    Install vLLM 0.19.0+: pip install 'vllm>=0.19.0' huggingface_hub. Download FP8 weights: huggingface-cli download zai-org/GLM-5.2-FP8 --local-dir ./glm5-2-fp8 --repo-type model. Approximately 800 GB of NVMe storage required.

  4. Launch the vLLM inference server

    Run: python -m vllm.entrypoints.openai.api_server --model ./glm5-2-fp8 --served-model-name glm5-2-fp8 --tensor-parallel-size 8 --quantization fp8 --enable-expert-parallel --max-model-len 131072 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.92 --enable-chunked-prefill --max-num-seqs 32 --port 8000. Increase --max-model-len up to 1048576 for 1M-context workloads but ensure KV cache headroom.

  5. Validate and benchmark

    Run: curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "glm5-2-fp8", "messages": [{"role": "user", "content": "Write a Python function to reverse a linked list."}]}'. Benchmark throughput with vllm's benchmark_serving.py before routing production traffic.

FAQ / 05

Frequently Asked Questions

GLM-5.2 has 744B total parameters with ~40B active per forward pass (MoE). In FP8 (1 byte/param), weights consume roughly 744 GB, so you need 8x H200 SXM5 (1,128 GB total) for FP8 production. AWQ INT4 drops the footprint to ~372 GB, fitting on 4x H200 or a single 8x H100 node. For the 1M-token context window, budget an additional 80-160 GB for KV cache beyond the weight footprint, making 8x H200 or 8x B200 the practical minimum for 1M-context workloads.

GLM-5.2 reduces total parameter count from 754B to 744B, introduces dual reasoning modes (High and Max), expands the usable context from 200K to 1M tokens, and sharpens the training focus on agentic coding tasks. The MIT license is unchanged. Active parameter count remains ~40B per forward pass, so VRAM for weights is marginally lower but KV cache at 1M context is the new binding constraint.

Yes. vLLM 0.19.0+ supports GLM-family MoE models. Use --tensor-parallel-size 8, --enable-expert-parallel, --quantization fp8, --max-model-len 131072 (or up to 1048576 for 1M context), --kv-cache-dtype fp8_e5m2, and --enable-chunked-prefill. For 1M-context workloads you will need chunked prefill enabled plus sufficient KV cache VRAM - 8x H200 or 8x B200 is recommended.

GLM-5.2 targets a higher coding performance ceiling than GLM-5.1 with its coding-first training focus and dual reasoning modes. Z.ai reports it beats GPT-5.5 on coding at roughly one-sixth the API cost. The 744B vs 754B parameter difference is architecturally minor; the main improvements are in agentic coding capabilities and the 1M-token context window.

FP8 on 8x H200 SXM5 spot instances on Spheron gives the best cost-per-token for high-volume workloads. For lower concurrency where the 1M context window is not needed, AWQ INT4 on 4x H200 on-demand is more cost-effective. Prices fluctuate - check the live rates at spheron.network/pricing/ before sizing your cluster.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.