Comparison

Modular MAX and Mojo on GPU Cloud: Deploy an LLM Inference Engine That Outperforms vLLM (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 13, 2026
Modular MAXMAX Inference EngineMojo GPU InferenceModular vs vLLMMAX LLM ServingLLM InferenceH100GPU Cloud
Modular MAX and Mojo on GPU Cloud: Deploy an LLM Inference Engine That Outperforms vLLM (2026 Guide)

MAX by Modular is a graph-compiled inference engine that targets CUDA, ROCm, and Apple Metal from a single Mojo kernel codebase. For teams already running vLLM in production, the vLLM vs TensorRT-LLM vs SGLang benchmarks cover how the established engines compare; this post adds MAX as a fifth option with a different architectural approach, and shows where it pulls ahead on dense models at high concurrency. If you are already on vLLM and deciding whether to migrate, the vLLM production deployment guide has the configuration reference you need for the switch.

TL;DR

EngineBest ForThroughput (50 req)TTFT p50 (10 req)Cold StartHardware Portability
MAXDense models, NVIDIA/AMD/Apple2,150 tok/s105 ms~8 min (first) / ~65 s (cached)CUDA + ROCm + Metal
vLLMGeneral use, broad model support1,850 tok/s120 ms~62 sCUDA + ROCm
SGLangShared-prefix, low TTFT1,920 tok/s112 ms~58 sCUDA + ROCm
TensorRT-LLMMax throughput, fixed model2,100 tok/s105 ms~28 minCUDA only
  • Use MAX if you serve dense models at high concurrency on NVIDIA or AMD hardware and want kernel-level control without writing CUDA C++.
  • Use vLLM if you need the broadest model support, multi-LoRA serving, or the fastest path to production.
  • Use SGLang if your workload has shared prefixes: multi-turn conversations, RAG pipelines, agentic systems.

What MAX Is

MAX has three layers that work together. Each one does something the others don't.

MAX Graph Compiler

The graph compiler takes a PyTorch or ONNX model and processes it through MLIR to produce fused, hardware-specific IR. It operates at the computation graph level, above individual kernel calls. This lets it fuse operations that span the Python/CUDA boundary: for example, folding RoPE positional embeddings, attention, and output projection into a single kernel pass that never materializes intermediate tensors.

MAX Serve

MAX Serve is the OpenAI-compatible serving layer. It exposes /v1/chat/completions and /v1/completions endpoints, handles continuous batching, and manages the KV cache with a paging mechanism equivalent to vLLM's PagedAttention. You can swap MAX Serve in front of any OpenAI client without code changes.

Mojo Kernel Layer

Mojo is a Python-superset language that compiles via MLIR to PTX (NVIDIA CUDA), AMDGPU IR (ROCm), or Metal shaders depending on the target. A Mojo function signature looks nothing like CUDA C++:

python
# Mojo: typed, SIMD-native, type-safe pointers
fn fused_attention[dtype: DType, tile_size: Int](
    q: Tensor[dtype],
    k: Tensor[dtype],
    v: Tensor[dtype],
    scale: Scalar[dtype],
) -> Tensor[dtype]:
    var scores = (q @ k.T) * scale
    scores = softmax(scores)
    return scores @ v

Compare this to the equivalent CUDA C++ kernel signature, which requires explicit template instantiation, raw pointer arithmetic, and a separate .h and .cu file. Mojo handles SIMD vectorization, memory layout, and target-specific codegen automatically.

Why Modular Built Another Inference Engine

The core problem is a boundary in vLLM's architecture. vLLM schedules requests in Python, calls attention kernels via Triton or hand-written CUDA C++ through pybind11, and handles memory management with PagedAttention. Each layer boundary is a fence where graph-level fusion stops.

You can't fuse an operation that crosses from the Python scheduler into a Triton kernel and then into a CUDA C++ custom op. The Python runtime doesn't have visibility across that boundary. So optimization decisions that require seeing the full computation graph, such as fusing RoPE into the attention prefill pass, or combining layer normalization with the feed-forward projection, can't be made.

MAX's compiler sees the entire graph before generating any code. It can fuse across what would be multiple separate kernel calls in vLLM. On H100, this matters most at high concurrency, where compute utilization is high enough that fusion overhead amortizes and throughput increases measurably. At low concurrency (1-5 requests), the gains are smaller because the GPU has headroom anyway.

Installing MAX on a Spheron H100 or B200

The following cloud-init script handles the full setup from a fresh GPU instance. Run it as root on Spheron after provisioning.

bash
#!/bin/bash
set -e

# Install Docker
curl -fsSL https://get.docker.com | sh

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update && apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

# Pull MAX container
# Pin to a specific tag in production instead of :latest
docker pull modular/max-openai-api:latest

# Create systemd service
cat > /etc/systemd/system/max-serve.service << 'EOF'
[Unit]
Description=MAX Serve LLM API
After=docker.service
Requires=docker.service

[Service]
Restart=always
EnvironmentFile=/etc/max-serve.env
ExecStart=/usr/bin/docker run --rm --gpus all --ipc=host \
  -p 8000:8000 \
  -e HF_TOKEN=${HF_TOKEN} \
  -v /opt/max-cache:/root/.cache/modular \
  modular/max-openai-api:latest \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --max-batch-size 256 \
  --max-length 8192

[Install]
WantedBy=multi-user.target
EOF

# Create env file (add your token)
echo "HF_TOKEN=your_hf_token_here" > /etc/max-serve.env
chmod 600 /etc/max-serve.env

# Create cache directory so graph compilation persists across restarts
mkdir -p /opt/max-cache

systemctl daemon-reload
systemctl enable max-serve

Spot instance note: MAX compiles the computation graph on first startup. This takes 5-10 minutes for 8B models and longer for 70B+ models. Mount /opt/max-cache on a persistent volume when running on spot instances, or you pay the compilation cost on every restart. If persistent storage is unavailable, switch to an on-demand instance for production serving with MAX.

For interactive testing without the systemd service:

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  -v /opt/max-cache:/root/.cache/modular \
  modular/max-openai-api:latest \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --max-batch-size 256 \
  --max-length 8192

For complementary setup docs, Spheron's LLM quick-guides cover common model deployments in detail.

Serving Llama, Qwen, and DeepSeek with MAX Serve

Llama 3.1 8B (single H100, FP8):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  modular/max-openai-api:latest \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --kv-cache-dtype fp8 \
  --max-batch-size 512 \
  --max-length 8192

Qwen2.5 72B (tensor parallel across 2x H100):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  modular/max-openai-api:latest \
  --model-path Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --max-batch-size 256 \
  --max-length 32768

DeepSeek R1 32B (single H100, FP8):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  modular/max-openai-api:latest \
  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-32B \
  --kv-cache-dtype fp8 \
  --max-batch-size 128 \
  --max-length 16384

Key MAX Serve flags:

FlagDefaultWhat it does
--tensor-parallel-size1Split model across N GPUs
--max-batch-size256Max sequences in a single forward pass
--kv-cache-dtypefp16KV cache precision (fp8 saves ~50% VRAM on H100)
--max-length4096Max sequence length (context + output)
--custom-kernel-pathnonePath to compiled Mojo kernels to override defaults

All three models work with the standard OpenAI Python client:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain fused attention kernels in two sentences."}],
)
print(response.choices[0].message.content)

Mojo Kernel Basics: Fused Attention Without CUDA C++

This is the section other deployment guides skip. Mojo lets you write a custom attention kernel that fuses RoPE + attention in a single pass, then register it with MAX Serve. No CUDA C++, no Triton, no pybind11.

Here is a minimal fused attention kernel skeleton:

python
from memory import memset_zero
from math import sqrt
from algorithm import vectorize, parallelize

# SIMD width for float32 on H100 (128-bit SIMD = 4x float32)
alias SIMD_WIDTH = 4

struct FlashAttentionConfig:
    var head_dim: Int
    var num_heads: Int
    var scale: Float32

    fn __init__(inout self, head_dim: Int, num_heads: Int):
        self.head_dim = head_dim
        self.num_heads = num_heads
        # Pre-compute 1/sqrt(head_dim) to avoid repeated division
        self.scale = 1.0 / sqrt(Float32(head_dim))


# @parameter marks a compile-time constant - the compiler unrolls this loop
@parameter
fn fused_rope_attention[dtype: DType, head_dim: Int](
    q: DTypePointer[dtype],
    k: DTypePointer[dtype],
    v: DTypePointer[dtype],
    out: DTypePointer[dtype],
    seq_len: Int,
    config: FlashAttentionConfig,
):
    # Apply RoPE in-place, then compute attention in the same pass.
    # vectorize() maps SIMD_WIDTH elements per iteration automatically.
    @parameter
    fn rope_and_attn_step[simd_width: Int](idx: Int):
        # Load Q and K slices
        var q_vec = q.load[width=simd_width](idx)
        var k_vec = k.load[width=simd_width](idx)

        # Rotate by position (simplified - real impl uses sin/cos tables)
        let half = simd_width // 2
        var q_rot = q_vec.rotate_right[half]()
        var k_rot = k_vec.rotate_right[half]()

        # Fused: write rotated Q/K, scale, and accumulate
        q.store[width=simd_width](idx, q_rot)
        k.store[width=simd_width](idx, k_rot)

    vectorize[rope_and_attn_step, SIMD_WIDTH](head_dim)

What each part does:

  • DTypePointer[dtype] - a typed pointer that tells the compiler what memory layout to expect. No raw void*.
  • @parameter - signals that the following value is known at compile time. The compiler uses this to fully unroll loops and pick the right SIMD width.
  • vectorize[fn, SIMD_WIDTH] - auto-generates SIMD instructions for the given width. On H100, this maps to PTX ld.global.v4.f32 and st.global.v4.f32 instructions directly.
  • mlir_attr - for advanced use, you can annotate memory access patterns with MLIR attributes to control cache behavior (L1, L2, or streaming) without dropping into inline CUDA.

To register the kernel with MAX Serve, compile it and pass the path at startup:

bash
mojo build fused_rope_attention.mojo -o /opt/kernels/fused_rope_attn.so

docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  -v /opt/kernels:/kernels \
  modular/max-openai-api:latest \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --custom-kernel-path /kernels/fused_rope_attn.so

MAX Serve validates the kernel signature against the model's attention shape before loading. If the shapes don't match, it falls back to the compiled default kernel with a warning.

Benchmark Setup

Hardware

We ran all benchmarks on a single Spheron H100 SXM5 80GB instance at on-demand rates. The instance runs on bare metal with no hypervisor overhead, which matters for MAX: the graph compiler can see the full NVLink topology and compile kernels that pin to specific memory banks. Host driver 570.86.15 (current stable shipping driver as of May 2026). vLLM and SGLang run CUDA 13.0 (cu130) containers; MAX runs its own CUDA 12.6-based container.

Model

meta-llama/Llama-3.1-8B-Instruct in FP8. The 8B model was chosen because it saturates the batching system rather than memory bandwidth on H100: at FP8, weights occupy under 10GB, leaving 70+ GB for KV cache. This isolates scheduling and kernel efficiency differences between engines rather than memory capacity. At 70B, the benchmark would measure memory bandwidth and tensor parallel communication instead of batching efficiency.

Framework versions:

  • MAX: modular/max-openai-api:latest (current stable as of May 2026)
  • vLLM v0.18.0
  • SGLang v0.5.9

Reproducibility note: MAX compiles the computation graph on first startup and caches the result to disk. Benchmarks were run after the initial compilation (second startup) to measure steady-state performance, not compilation overhead. Running benchmarks against a cold MAX instance will show worse numbers because compilation time inflates the measurement window.

Benchmark Methodology

Async Python client built on aiohttp. Each run used 200 prompts from a diverse instruction dataset, with average input length of 512 tokens and average output length of 256 tokens (fixed seed 42). Four concurrency levels: 1, 10, 50, and 100 simultaneous requests. Each level ran for 3 minutes after a 60-second warmup. VRAM sampled via nvidia-smi --query-gpu=memory.used at 1-second intervals; peak is the maximum during the measurement window.

Benchmark Results

Throughput (Output Tokens per Second)

ConcurrencyMAXvLLMSGLangTensorRT-LLM
1 req128 tok/s120 tok/s125 tok/s130 tok/s
10 req740 tok/s650 tok/s680 tok/s710 tok/s
50 req2,150 tok/s1,850 tok/s1,920 tok/s2,100 tok/s
100 req2,760 tok/s2,400 tok/s2,460 tok/s2,780 tok/s

MAX leads at all concurrency levels for this dense model workload. The gap is most pronounced at 50 and 100 concurrent requests (16% and 15% above vLLM respectively), which is where the graph compiler's fusion advantages compound with the scheduler's efficiency at high KV cache pressure.

For background on the continuous batching methodology used here, see the LLM serving optimization guide.

TTFT (Time to First Token, Milliseconds)

ConcurrencyMAX p50MAX p95vLLM p50vLLM p95SGLang p50SGLang p95
1 req28 ms42 ms35 ms55 ms32 ms48 ms
10 req68 ms108 ms82 ms130 ms76 ms115 ms
50 req105 ms195 ms120 ms210 ms112 ms190 ms
100 req195 ms380 ms230 ms410 ms210 ms370 ms

MAX's p50 TTFT is consistently below vLLM due to graph-level prefill fusion. The p95 gap narrows at high concurrency because scheduling jitter starts to dominate latency variance for both engines.

Inter-Token Latency (Milliseconds)

ConcurrencyMAXvLLMSGLang
50 req18 ms21 ms20 ms
100 req36 ms42 ms39 ms

VRAM Usage (GB)

EngineIdle (model loaded)Peak at 50 reqPeak at 100 req
MAX9.2 GB28.4 GB48.7 GB
vLLM8.8 GB29.1 GB50.2 GB
SGLang9.0 GB28.8 GB49.5 GB

VRAM profiles are similar across engines for the same model. MAX uses slightly more at idle due to the compiled graph structures living in GPU memory.

Cold Start Time

EngineFirst Start (graph compilation)Subsequent Starts (cached graph)
MAX~8 min (8B)~65 sec
vLLMN/A~62 sec
SGLangN/A~58 sec
TensorRT-LLM~28 min (70B)~90 sec

MAX's cold start on first run is slower than vLLM and SGLang because it compiles the full computation graph to native code before serving any requests. Once the compiled graph is cached to disk, subsequent starts are comparable to vLLM. This is the same tradeoff TensorRT-LLM makes, but MAX's compilation time is faster for smaller models and the cache is more portable.

Where MAX Wins Today

Custom Kernels and Operator Fusion

The clearest advantage. If you need fused RoPE + attention, custom quantization formats, or operators not yet supported by vLLM's kernel library, Mojo lets you write them once and compile to any target. The alternative with vLLM is writing Triton kernels in Python or contributing CUDA C++ to the vLLM repo.

AMD GPU Support (ROCm)

Mojo targets AMDGPU IR natively. A kernel written for CUDA compiles to ROCm with no code changes. vLLM's ROCm support requires maintaining separate code paths at the CUDA/HIP boundary. If your fleet includes AMD MI300X or you want hardware-vendor flexibility, MAX's single-codebase portability is a real advantage. Check GPU pricing for current AMD instance availability on Spheron.

Apple Silicon (Development)

MAX's Metal backend compiles the same Mojo kernels to Metal shaders. This means you can prototype and debug custom attention kernels on a MacBook Pro before deploying to H100. The performance on Apple Silicon is not production-grade for large models, but the development workflow is useful: local iteration is fast, and the compiled kernel runs the same logic on cloud GPUs.

Bare-metal Spheron instances expose NVLink directly. MAX's graph compiler queries the NVLink topology at startup and can schedule tensor parallel all-reduce operations to use the optimal NVLink paths. On hypervisored instances or PCIe-only configurations, this advantage disappears.

Where MAX Falls Short

MoE Model Support

MAX's support for mixture-of-experts models is still maturing as of May 2026. Llama 4 Maverick and DeepSeek V3 (which are both MoE architectures) are not fully optimized in MAX. If your workload is primarily MoE inference, vLLM is the better choice today. MAX's dense model performance advantage does not carry over to MoE architectures where expert routing adds complexity that the graph compiler doesn't yet handle as efficiently.

Multi-LoRA Serving

vLLM and SGLang support serving multiple LoRA adapters simultaneously from a single base model, with per-request adapter selection. MAX does not have equivalent multi-LoRA support in the current release. For multi-tenant setups where different customers or use cases use different fine-tuned adapters, see the LoRA multi-adapter serving guide.

Ecosystem Integrations

vLLM has a larger ecosystem: more Prometheus metrics exporters, Kubernetes operators, LangChain and LlamaIndex connectors, and deployment templates from cloud providers. MAX's metrics endpoint is Prometheus-compatible, but the range of off-the-shelf integrations is narrower. Expect to write more glue code if you have existing monitoring infrastructure built around vLLM's specific metric names.

Community and Model Coverage

vLLM supports more quantization formats (GPTQ, AWQ, SqueezeLLM, and more) and more model architectures. If you work with less common models or non-standard quantization, vLLM is less likely to require a workaround. MAX's model zoo is growing but still lags vLLM's breadth.

Production Patterns

KV Cache Configuration

Key MAX Serve environment variables for KV cache tuning:

VariableDefaultRecommended (H100 80GB, 8B model)Description
MAX_KV_CACHE_SIZE0.85 (fraction)0.90Fraction of free VRAM allocated to KV cache
MAX_KV_CACHE_DTYPEfp16fp8KV cache precision (fp8 halves VRAM usage)
--max-batch-size256512Max concurrent sequences (increase for higher throughput)
--max-length40968192Max sequence length; lower = more room for batching

For an H100 80GB serving the 8B model at FP8, setting MAX_KV_CACHE_SIZE=0.90 and MAX_KV_CACHE_DTYPE=fp8 leaves under 1GB for activations but the 8B model's activation footprint is small enough that this is safe.

Load Balancing

For CPU and memory redundancy (not multi-GPU, which tensor parallelism handles inside MAX), run two MAX Serve processes on different ports and put nginx in front:

nginx
upstream max_serve {
    server 127.0.0.1:8000;
    server 127.0.0.1:8001;
    keepalive 32;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://max_serve;
        proxy_set_header Connection "";
        proxy_http_version 1.1;
        proxy_buffering off;
        proxy_read_timeout 120s;
    }
}

For multi-node orchestration above this level, the Ray Serve on GPU Cloud guide covers multi-node deployments with automatic failover.

Observability

MAX Serve exposes /health and /metrics (Prometheus format). The metrics schema is close to vLLM's, so existing Grafana dashboards need minimal updates:

bash
docker run --gpus all --ipc=host \
  -p 8000:8000 \
  -p 9090:9090 \
  -e HF_TOKEN=$HF_TOKEN \
  modular/max-openai-api:latest \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --metrics-port 9090

Key MAX metrics to watch: max_serve:num_requests_running, max_serve:kv_cache_usage_perc, max_serve:time_to_first_token_seconds. These map directly to the equivalent vLLM metrics.

Cost Analysis: MAX vs vLLM on Spheron H100

Live pricing fetched from Spheron's GPU catalog. H100 SXM5 spot rate: $1.69/hr.

Formula: cost_per_M_tokens = (spot_rate_per_hr / throughput_tok_per_sec / 3600) * 1,000,000

EngineThroughput (50 req)Spheron H100 SXM5 spot rateCost per 1M output tokens
MAX2,150 tok/s$1.69/hr$0.22
vLLM1,850 tok/s$1.69/hr$0.25
SGLang1,920 tok/s$1.69/hr$0.24
TensorRT-LLM2,100 tok/s$1.69/hr$0.22

At 50 concurrent requests, MAX and TensorRT-LLM reach the same cost per million tokens. The practical difference is deployment complexity: TensorRT-LLM requires a 28-minute compilation step per model version; MAX compiles once in 8 minutes and caches the result.

Pricing fluctuates based on GPU availability. The prices above are based on 13 May 2026 and may have changed. Check current GPU pricing for live rates.

Decision Matrix: When to Pick MAX Over vLLM or SGLang

Workload TypeRecommended EngineReason
Dense model, high concurrency, NVIDIAMAXGraph-level fusion, 15-16% throughput advantage
MoE model (Llama 4 Maverick, DeepSeek V3)vLLMMAX MoE support not yet mature
Multi-LoRA adapter servingvLLM or SGLangMAX lacks multi-LoRA in current release
Shared-prefix workloads (RAG, chatbot)SGLangRadixAttention reduces TTFT significantly
AMD GPU (MI300X)MAXSingle Mojo codebase targets ROCm natively
Apple Silicon (development/prototyping)MAXMetal backend; good for local iteration
Custom kernel researchMAXWrite in Mojo, compile to any target
Maximum compatibility + ecosystemvLLMWidest model and integration coverage
Fixed model, maximum throughputTensorRT-LLMLeads benchmarks after upfront compilation

MAX's graph compiler extracts real throughput gains from bare-metal GPU nodes where the CUDA stack is fully exposed. Spheron's H100 and B200 instances run without a hypervisor layer, so MAX sees the full NVLink topology and can compile kernels pinned to specific memory banks. Deploy MAX Serve on Spheron in under 5 minutes with the cloud-init script above.

Rent H100 → | Rent B200 → | View GPU pricing →

Deploy MAX on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.