Deploy GLM-5.2 on GPU Cloud: VRAM, Cost & vLLM Setup (2026)

GLM-5.2 shipped on June 13, 2026. Z.ai announced the weights would be released under MIT license, with a staggered open-weight rollout scheduled for the days following launch. The model has a 744B MoE architecture that keeps the same ~40B active parameter count as GLM-5.1 but adds two things that matter for production: a 1M-token context window and a coding-first training focus that Z.ai claims beats GPT-5.5 on coding at roughly one-sixth the API cost. This guide covers the full deployment path: VRAM sizing for both standard and 1M-context workloads, vLLM and SGLang configuration, and cost math using live Spheron pricing. For the per-quant VRAM numbers on their own, the GLM-5.2 VRAM requirements page keeps them paired with live GPU rates.

If you are coming from the GLM-5.1 deployment guide, the hardware setup is nearly identical with one important difference: 1M-context workloads require FP8 KV cache and leave less headroom on 8x H200, so node sizing needs careful attention.

What's New in GLM-5.2 vs GLM-5.1

Feature	GLM-5.1	GLM-5.2
Total parameters	754B	744B
Active parameters	~40B	~40B
Context window	200K tokens	1M tokens
Reasoning modes	None	High, Max
Training focus	Mixed coding + reasoning	Coding-first
License	MIT	MIT
Release date	April 7, 2026	June 13, 2026

The parameter reduction from 754B to 744B is minor and does not materially change VRAM requirements. The meaningful upgrades are the 1M-token context and the dual reasoning modes.

High vs Max reasoning modes in practice: High mode produces responses with lighter chain-of-thought, which means shorter sequences and lower per-token latency. Max mode activates deeper reasoning chains for harder agentic tasks. At inference time, this affects token budget, not GPU configuration. High mode is the right default for interactive coding agents; Max mode makes sense for batch agentic pipelines where latency is less critical than quality on hard problems. Reasoning mode names are per Z.ai's June 13, 2026 release notes; verify against the official model card if Z.ai updates the naming after this guide was written.

GPU Memory Requirements

The same MoE misconception that applied to GLM-5.1 applies here: only ~40B parameters activate per forward pass, but all 744B must live in GPU memory. Expert routing cannot page weights in and out at inference time without prohibitive latency.

Weight memory by quantization:

BF16: 744B x 2 bytes = ~1,488 GB. Requires multi-node setup.
FP8: 744B x 1 byte = ~744 GB. Fits on 8x H200 SXM5 with headroom.
AWQ INT4: 744B x 0.5 bytes = ~372 GB, plus 5-10% for activations. Fits on 4x H200 or 5x A100 80GB.

For VRAM planning across other models, the GPU requirements cheat sheet for 2026 covers the memory math in a single lookup table.

1M-Context KV Cache Sizing

The 1M-token context window changes the VRAM picture significantly. KV cache memory grows linearly with sequence length:

kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For GLM-5.2 at 1M tokens (1,048,576 sequence length), approximate figures based on the model's MoE architecture:

KV Precision	Seq Length	Batch Size	Approx KV Cache
FP16	131,072 (128K)	4	~80 GB
FP16	1,048,576 (1M)	1	~160 GB
FP8	1,048,576 (1M)	1	~80 GB
FP8	1,048,576 (1M)	4	requires OOM management

FP8 KV cache (--kv-cache-dtype fp8_e5m2) is mandatory for 1M-context workloads. At FP16 with 744 GB of weights already loaded on 8x H200 (1,128 GB total), you have roughly 384 GB of headroom. Serving batch=1 at 1M context already consumes ~160 GB of that at FP16. FP8 KV halves those numbers to ~80 GB for batch=1. For 1M-context, also set --max-num-seqs 1 or --max-num-seqs 2 to avoid OOM at peak concurrency.

For a deep dive on KV cache memory math and optimization strategies, see the KV cache optimization guide.

GPU configurations that work for GLM-5.2:

GPU	VRAM per card	Count for FP8	Count for AWQ INT4
H200 SXM5	141 GB	8x (1,128 GB)	4x (564 GB)
B200 SXM6	192 GB	8x (1,536 GB)	4x (768 GB)
H100 SXM5	80 GB	10x (800 GB)	5x (400 GB)
A100 80GB SXM4	80 GB	10x (800 GB)	5x (400 GB)

B200 is particularly well-suited for 1M-context workloads: its higher HBM3e bandwidth reduces time-to-first-token at long sequence prefill, and 8x B200 (1,536 GB) gives significantly more KV headroom than 8x H200 (1,128 GB). See available Spheron B200 instances for current spot pricing.

Spheron GPU Configurations and Pricing

Pricing fetched from the Spheron API on 06 Jul 2026:

Configuration	Precision	Total VRAM	Spot Price	On-Demand Price
8x H200 SXM5	FP8	1,128 GB	$26.48/hr (8 x $3.31)	$29.60/hr (8 x $3.70)
4x H200 SXM5	AWQ INT4	564 GB	$13.24/hr (4 x $3.31)	$14.80/hr (4 x $3.70)
8x B200 SXM6	FP8, 1M context	1,536 GB	$42.72/hr (8 x $5.34)	$74.88/hr (8 x $9.36)

Pricing fluctuates based on GPU availability. The prices above are based on 06 Jul 2026 and may have changed. Check current GPU pricing → for live rates.

The 8x H200 SXM5 instances FP8 spot configuration at $26.48/hr is the cost-optimal entry point for production, though the spot discount vs on-demand ($29.60/hr) has narrowed a lot since June, so weigh the reclaim risk before defaulting to spot. For 1M-context workloads where you need the KV headroom, 8x B200 at $42.72/hr spot gives 408 GB more total VRAM and better prefill throughput at long sequences.

Step-by-Step Deployment with vLLM

Step 1: Provision the GPU Node on Spheron

Log in to app.spheron.ai, select an 8x H200 SXM5 or 8x B200 SXM6 instance. For standard workloads, spot is the right choice. For latency-sensitive production serving, use on-demand. SSH into the instance once it provisions (see the SSH connection guide).

Set storage to at least 800 GB NVMe for the FP8 weights.

Step 2: Install Dependencies

bash

# CUDA 12.4+ required
pip install "vllm>=0.19.0" huggingface_hub
export HF_TOKEN=<your_token>

Step 3: Download Model Weights

bash

# FP8 variant (~800 GB on NVMe)
huggingface-cli download zai-org/GLM-5.2-FP8 \
  --local-dir ./glm5-2-fp8 \
  --repo-type model

The weights follow Z.ai's naming convention from GLM-5.1 (zai-org/GLM-5.1-FP8). Verify the repo exists at Hugging Face before running, as weights may still be uploading in the days following the June 13 launch. If the FP8 repo is not yet available, use the base zai-org/GLM-5.2 weights and quantize on the fly with vLLM's --load-format auto or produce AWQ INT4 via AutoAWQ.

Step 4: Launch the vLLM Server

For standard workloads (up to 128K context):

bash

python -m vllm.entrypoints.openai.api_server \
  --model ./glm5-2-fp8 \
  --served-model-name glm5-2-fp8 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --max-num-seqs 32 \
  --port 8000

For 1M-context workloads (requires 8x H200 or 8x B200 with FP8 KV):

bash

python -m vllm.entrypoints.openai.api_server \
  --model ./glm5-2-fp8 \
  --served-model-name glm5-2-fp8 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 1048576 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --max-num-chunked-tokens 32768 \
  --max-num-seqs 1 \
  --port 8000

Flag explanations:

Flag	Purpose
`--tensor-parallel-size 8`	Splits model shards across 8 GPUs
`--enable-expert-parallel`	Distributes MoE experts across GPUs for parallel routing
`--kv-cache-dtype fp8_e5m2`	Halves KV cache memory; mandatory for 1M-context
`--enable-chunked-prefill`	Essential at 1M context to prevent single long prompts from blocking the batch
`--max-num-seqs 1`	Required for 1M-context to avoid OOM at peak concurrency

Note: in some vLLM v0.19.x minor versions the expert parallelism flag may be --moe-expert-parallel-size instead of --enable-expert-parallel. Check your installed version's release notes if the flag is not recognized.

See vLLM vs TensorRT-LLM vs SGLang benchmarks for a detailed throughput comparison across serving frameworks.

Step 5: Validate

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm5-2-fp8",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a linked list."}],
    "max_tokens": 512
  }'

Benchmark throughput with vLLM's benchmark_serving.py before routing production traffic.

Serving GLM-5.2's 1M-Token Context Window

The 1M-token context window is the feature that separates GLM-5.2 from GLM-5.1 for agentic coding workloads. Serving it requires more than just increasing --max-model-len. Here is what actually needs tuning.

Chunked Prefill

At 1M tokens, a single prefill call processes one million tokens before generating a single output token. Without chunked prefill, this blocks your entire batch for seconds. With chunked prefill enabled, vLLM processes the long prompt in chunks while interleaving other requests:

bash

--enable-chunked-prefill
--max-num-chunked-tokens 32768

Set --max-num-chunked-tokens to a value your GPU can handle without latency spikes on shorter requests. 32,768 is a good starting point; increase if throughput is the bottleneck, decrease if short-request latency degrades.

Prefix Caching

For agentic coding workflows where multiple agent steps share a long system prompt or a large codebase context, prefix caching reduces TTFT on subsequent calls:

bash

--enable-prefix-caching

This tells vLLM to cache the KV state of shared prompt prefixes and reuse them across requests. If your agentic workflow repeatedly sends the same 50K-token codebase context, prefix caching turns the second and subsequent requests into incremental fills. It has minimal benefit when every request has a unique prompt.

FP8 KV Cache

--kv-cache-dtype fp8_e5m2 cuts KV memory in half with minimal quality impact. At 1M context, this flag is not optional; it is what makes 1M context viable on 8x H200.

NVMe KV Offloading

If you are running out of VRAM at peak batch concurrency, NVMe KV offloading lets vLLM spill overflow KV blocks to fast NVMe storage rather than returning OOM errors. See NVMe KV cache offloading for LLM inference for the setup details. This is a fallback, not a substitute for correct VRAM sizing.

GLM-5.2 shares attention optimizations with similar MoE architectures that reduce attention FLOPs by ~98% at 128K-1M context ranges, which directly impacts prefill latency on long coding context inputs.

Deploy GLM-5.2 with SGLang

SGLang works well for multi-turn agentic coding workflows. Its RadixAttention caches the KV state of shared prefixes across agent steps, which directly helps GLM-5.2's coding-first use case where agent steps often share a large codebase context window.

bash

pip install 'sglang[all]>=0.5.10'

python -m sglang.launch_server \
  --model-path ./glm5-2-fp8 \
  --tp 8 \
  --quantization fp8 \
  --enable-moe-ep \
  --context-length 131072 \
  --port 30000

For 1M-context workloads:

bash

python -m sglang.launch_server \
  --model-path ./glm5-2-fp8 \
  --tp 8 \
  --quantization fp8 \
  --enable-moe-ep \
  --context-length 1048576 \
  --mem-fraction-static 0.88 \
  --kv-cache-dtype fp8_e5m2 \
  --port 30000

--enable-moe-ep is SGLang's expert parallelism flag, equivalent to vLLM's --enable-expert-parallel. Without it, tensor parallelism copies all expert weights to all GPUs, which wastes memory and reduces effective batch capacity.

RadixAttention is particularly useful for GLM-5.2 agentic coding use cases: when each agentic step sends the full codebase context plus the new task, RadixAttention keeps the codebase KV in cache and computes only the incremental portion. For a full production walkthrough, see the SGLang production deployment guide.

GLM-5.2 vs GLM-5.1 vs Kimi K2.7 Code vs DeepSeek V4: Performance and Cost

Model	Total Params	Active Params	FP8 VRAM	Context	Spheron Spot (8x H200)	Reasoning
GLM-5.1	754B	~40B	~754 GB	200K	$26.48/hr	None
GLM-5.2	744B	~40B	~744 GB	1M	$26.48/hr	High, Max
Kimi K2.7 Code	~1T	~32B	~1,000 GB	256K	~$26.48/hr	None
DeepSeek V4 Pro	~1.6T	~49B	~1,600 GB	1M	~$26.48/hr	None

GLM-5.2 wins over Kimi K2.7 Code when you need an MIT license with no usage restrictions, a 1M-token context window for long codebase ingestion, and Z.ai's reported superiority on coding benchmarks at similar hardware cost. Kimi K2.7 Code is a stronger pick for tasks where raw parameter scale matters and the Modified MIT license is acceptable. See the Kimi K2.7 Code deployment guide for its full hardware configuration.

GLM-5.2 vs DeepSeek V4 comes down to training focus: GLM-5.2 is tuned for coding specifically, while DeepSeek V4 is a general frontier model with 1M context. If your workload is primarily agentic coding and not general reasoning or math, GLM-5.2's coding-first training shows measurable benchmark gains. See the DeepSeek V4 deployment guide for a side-by-side configuration comparison.

Cost-per-Token: Self-Hosted on Spheron vs Z.ai API

Hardware Cost Baseline

From current Spheron pricing (06 Jul 2026):

Configuration	Spot	On-Demand
8x H200 SXM5 (FP8, standard)	$26.48/hr	$29.60/hr
8x B200 SXM6 (FP8, 1M context)	$42.72/hr	$74.88/hr

Throughput and Cost-per-Million-Token

At realistic FP8 throughput for GLM-5.2 at 32K context, expect roughly 150-200 tokens/sec per 8-GPU node, representative from similar 744-754B MoE deployments. Using 175 tokens/sec as the midpoint:

cost_per_million = (hourly_rate / (tokens_per_second × 3600)) × 1_000_000

Config	Rate	Cost/Million Tokens
8x H200 spot	$26.48/hr	~$42/M
8x H200 on-demand	$29.60/hr	~$47/M
8x B200 spot	$42.72/hr	~$67/M

Z.ai's GLM-5.2 API is priced at roughly $1.40/M input tokens and $4.40/M output tokens. GPT-5.5 runs about $5/M input and $30/M output, making GLM-5.2's API roughly 7x cheaper on output for comparable coding quality.

Break-even analysis: At $26.48/hr on spot, running 24/7 costs roughly $19,066/month for an 8x H200 cluster. Against Z.ai's API at $4.40/M output tokens, the break-even sits at roughly 4.3B output tokens per month. That volume requires sustained high-concurrency batch workloads. Teams that reach that scale will see meaningful savings from self-hosting on Spheron; below it, Z.ai's API is more cost-efficient than maintaining a dedicated cluster. Self-hosting also makes sense when data privacy or dedicated SLAs are requirements regardless of volume.

Pricing fluctuates based on GPU availability. The prices above are based on 06 Jul 2026 and may have changed. Check current GPU pricing at spheron.network/pricing/ for live rates.

Quantization Options

FP8

FP8 is the recommended production path. H200 SXM5 and B200 SXM6 have native FP8 tensor cores, so there is no software emulation overhead. Quality regression versus BF16 is minimal on most coding benchmarks. Z.ai publishes the official FP8 variant at zai-org/GLM-5.2-FP8, following the same convention as GLM-5.1. Verify the repo exists on Hugging Face before downloading, as weights may still be uploading shortly after the June 13 launch.

AWQ INT4

AWQ INT4 cuts VRAM from ~744 GB to ~372 GB, which lets you fit GLM-5.2 on 4x H200 at $13.24/hr spot instead of 8x H200 at $26.48/hr. The tradeoff is a 1-3% quality regression on coding benchmarks. Z.ai may not publish an official AWQ variant; if not, self-quantize using AutoAWQ from the base zai-org/GLM-5.2 weights. For a complete quantization workflow, see the AWQ quantization guide for LLM deployment.

Production Optimization

Batching and Throughput

Set --max-num-seqs to 2x your expected peak concurrent requests for standard workloads. For 1M-context, drop to 1-2 to avoid OOM.
Enable --enable-chunked-prefill to prevent long coding context prompts from starving shorter requests in the same batch.
Set --max-num-batched-tokens to at least 8,192 for good GPU utilization on standard workloads.

Standard continuous batching and paged attention tuning guidance applies here: increase --max-num-batched-tokens for throughput, decrease for latency.

Expert Parallelism

GLM-5.2 uses the same fine-grained MoE routing as GLM-5.1. With --enable-expert-parallel in vLLM or --enable-moe-ep in SGLang, the runtime distributes experts across GPUs and routes tokens via NVLink/NVSwitch. Without expert parallelism, tensor parallelism copies all expert weights to all GPUs and wastes memory.

GLM-5.2's 40B-active expert routing benefits from DeepGEMM's FP8 grouped GEMM kernels for MoE efficiency. See the DeepEP and DeepGEMM setup guide for installation and integration steps. For a broader treatment of expert parallelism across major MoE architectures, see the MoE inference optimization guide.

KV Cache Tuning

FP8 KV cache (--kv-cache-dtype fp8_e5m2) is mandatory at 1M context and a good default for standard workloads too.
Set --max-model-len based on actual workload. For agentic coding agents, 32K-131K is usually the practical range; reserve 1M context for codebase ingestion tasks.
For overflow when context exceeds VRAM headroom, see the NVMe Offloading section above.

GLM-5.2 vs Newer Coding Models (Late 2026)

GLM-5.2 launched in June 2026, and several coding-focused models have shipped since. Here is how it holds up if you are choosing what to self-host today.

Qwen3-Coder-Next is strong on agentic tool use and cheaper to serve at INT4, but GLM-5.2 still leads on long-context codebase reasoning thanks to its usable 1M-token window. See the Qwen3-Coder-Next deployment guide.
DeepSeek V4-Pro matches GLM-5.2 on many coding benchmarks and has a larger community, but its weight footprint is heavier per active token. The DeepSeek V4-Pro deployment guide walks through sizing.
API-only frontier models score higher on a few coding leaderboards, but at 1M-context volume the self-hosted GLM-5.2 economics still win. The Claude Opus 4.8 API vs self-hosted cost breakdown shows where the crossover sits.

The practical takeaway: GLM-5.2 stays a strong pick when you need MIT-licensed weights, a real 1M-token context, and predictable cost. If your batch sizes push 8x H200 to its VRAM limit, the extra headroom on Spheron's B300 instances at 288 GB per GPU lets you run FP8 GLM-5.2 with more room for KV cache.

If VRAM headroom for 1M context is a constraint on 8x H200 today, the NVIDIA R100 (Rubin) pre-order page is worth watching. At 288 GB HBM3e per chip, a 4-GPU R100 node would hold the full FP8 GLM-5.2 weight set where 8x H200 is required now, with enough remaining VRAM for 1M-context KV cache at batch=2 or more.

GLM-5.2's 744B coding MoE is now MIT-licensed and ready to self-host. Spin up an 8x H200 spot cluster on Spheron and start serving 1M-context agentic coding requests at a fraction of API prices.
H200 SXM5 on Spheron → | Spheron B200 instances → | View all pricing →
Get started on Spheron →

STEPS / 05

Quick Setup Guide

Calculate VRAM requirements
For FP8 weights (744B x 1 byte = 744 GB) plus KV cache overhead, provision 8x H200 SXM5 (1,128 GB total). For 1M-context workloads, budget an additional 80-160 GB for KV cache on top of weights. AWQ INT4 reduces weight memory to ~372 GB (4x H200 or 5x A100), but 1M context at INT4 still needs headroom planning.
Provision a GPU node on Spheron
Log in to app.spheron.ai, select an 8x H200 SXM5 or 8x B200 SXM6 spot instance, set storage to 800 GB minimum, and choose Ubuntu 22.04. SSH into the instance using the guide at docs.spheron.ai/connecting/ssh-connection.
Install dependencies and download GLM-5.2 weights
Install vLLM 0.19.0+: pip install 'vllm>=0.19.0' huggingface_hub. Download FP8 weights: huggingface-cli download zai-org/GLM-5.2-FP8 --local-dir ./glm5-2-fp8 --repo-type model. Approximately 800 GB of NVMe storage required.
Launch the vLLM inference server
Run: python -m vllm.entrypoints.openai.api_server --model ./glm5-2-fp8 --served-model-name glm5-2-fp8 --tensor-parallel-size 8 --quantization fp8 --enable-expert-parallel --max-model-len 131072 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.92 --enable-chunked-prefill --max-num-seqs 32 --port 8000. Increase --max-model-len up to 1048576 for 1M-context workloads but ensure KV cache headroom.
Validate and benchmark
Run: curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "glm5-2-fp8", "messages": [{"role": "user", "content": "Write a Python function to reverse a linked list."}]}'. Benchmark throughput with vllm's benchmark_serving.py before routing production traffic.

FAQ / 05

Frequently Asked Questions

GLM-5.2 has 744B total parameters with ~40B active per forward pass (MoE). In FP8 (1 byte/param), weights consume roughly 744 GB, so you need 8x H200 SXM5 (1,128 GB total) for FP8 production. AWQ INT4 drops the footprint to ~372 GB, fitting on 4x H200 or a single 8x H100 node. For the 1M-token context window, budget an additional 80-160 GB for KV cache beyond the weight footprint, making 8x H200 or 8x B200 the practical minimum for 1M-context workloads.

GLM-5.2 reduces total parameter count from 754B to 744B, introduces dual reasoning modes (High and Max), expands the usable context from 200K to 1M tokens, and sharpens the training focus on agentic coding tasks. The MIT license is unchanged. Active parameter count remains ~40B per forward pass, so VRAM for weights is marginally lower but KV cache at 1M context is the new binding constraint.

Yes. vLLM 0.19.0+ supports GLM-family MoE models. Use --tensor-parallel-size 8, --enable-expert-parallel, --quantization fp8, --max-model-len 131072 (or up to 1048576 for 1M context), --kv-cache-dtype fp8_e5m2, and --enable-chunked-prefill. For 1M-context workloads you will need chunked prefill enabled plus sufficient KV cache VRAM - 8x H200 or 8x B200 is recommended.

GLM-5.2 targets a higher coding performance ceiling than GLM-5.1 with its coding-first training focus and dual reasoning modes. Z.ai reports it beats GPT-5.5 on coding at roughly one-sixth the API cost. The 744B vs 754B parameter difference is architecturally minor; the main improvements are in agentic coding capabilities and the 1M-token context window.

FP8 on 8x H200 SXM5 spot instances on Spheron gives the best cost-per-token for high-volume workloads. For lower concurrency where the 1M context window is not needed, AWQ INT4 on 4x H200 on-demand is more cost-effective. Prices fluctuate - check the live rates at spheron.network/pricing/ before sizing your cluster.

What's New in GLM-5.2 vs GLM-5.1

GPU Memory Requirements

1M-Context KV Cache Sizing

Spheron GPU Configurations and Pricing

Step-by-Step Deployment with vLLM

Step 1: Provision the GPU Node on Spheron

Step 2: Install Dependencies

Step 3: Download Model Weights

Step 4: Launch the vLLM Server

Step 5: Validate

Serving GLM-5.2's 1M-Token Context Window

Chunked Prefill

Prefix Caching

FP8 KV Cache

NVMe KV Offloading

Deploy GLM-5.2 with SGLang

GLM-5.2 vs GLM-5.1 vs Kimi K2.7 Code vs DeepSeek V4: Performance and Cost

Cost-per-Token: Self-Hosted on Spheron vs Z.ai API

Hardware Cost Baseline

Throughput and Cost-per-Million-Token

Quantization Options

FP8

AWQ INT4

Production Optimization

Batching and Throughput

Expert Parallelism

KV Cache Tuning

GLM-5.2 vs Newer Coding Models (Late 2026)

Quick Setup Guide

Calculate VRAM requirements

Provision a GPU node on Spheron

Install dependencies and download GLM-5.2 weights

Launch the vLLM inference server

Validate and benchmark

Frequently Asked Questions

01How much VRAM does GLM-5.2 need?

02What changed between GLM-5.1 and GLM-5.2?

03Can I run GLM-5.2 with vLLM?

04How does GLM-5.2 compare to GLM-5.1 on coding benchmarks?

05What is the cheapest way to serve GLM-5.2 in production?

Try It on Real GPUs