Mamba-3 and State Space Models on GPU Cloud: Deploy SSM Inference as the Transformer Alternative (2026 Guide)

Most GPU inference guides assume transformers. That assumption is starting to cost teams money.

State space models (SSMs) like Mamba-3, released by researchers from CMU, Princeton, Cartesia AI, and Together AI in March 2026, offer a different architecture tradeoff: linear-time inference that gets cheaper as context length grows rather than more expensive. For workloads with long context windows, the GPU economics flip. A model that needs 6x the compute of a transformer at 2K context may need half the compute at 64K.

Another class of non-autoregressive models gaining traction in 2026 is diffusion language models. Deploying dLLMs on GPU cloud covers LLaDA 2 and Mercury, which use iterative denoising rather than recurrent state for parallel token generation.

For xLSTM and RWKV-7, the two other major linear-attention releases of 2026, see the xLSTM and RWKV-7 GPU cloud deployment guide. For Liquid AI's LFM family, which uses structured linear operators rather than SSM-style state spaces, see the LFM2-8B-A1B and LFM2-2.6B GPU cloud deployment guide. Another 2026 architecture worth tracking is log-linear attention (O(N log N)), which sits between Mamba-3's linear O(N) scaling and quadratic attention - the O(N log N) attention guide for GPU cloud covers the decision matrix and cost modeling.

This guide covers how to size GPUs for Mamba-3 and SSM workloads, how to deploy with vLLM and SGLang, and when to actually use SSMs instead of a transformer. We also compare cost per token at long context lengths to give you a concrete decision framework. For background on the memory bandwidth bottleneck in LLM inference, see the AI memory wall inference guide.

What Are State Space Models and Why They Challenge Transformers in 2026

Transformers have a fundamental scaling problem at long context: the KV cache. Every new token added to a sequence requires the model to attend to all previous tokens. At 128K context, a 7B transformer is doing quadratically more work than it did at 2K context. VRAM use scales with sequence length, throughput collapses, and you end up buying more GPU just to handle longer documents.

SSMs work differently. Instead of storing every past token in a KV cache, they compress the past into a fixed-size recurrent state. The state size does not grow with sequence length. Whether you process 1K tokens or 128K tokens, the memory overhead of the recurrent state stays constant at 2-4 GB, compared to tens of gigabytes for a transformer KV cache at the same length.

This is not a new idea. RNNs and LSTMs used the same recurrent-state approach. What Mamba and its successors got right was selectivity: the state can learn which information to remember and which to discard, based on the input. That selective mechanism (called S4, S6, and in Mamba-3, a multi-head variant) is what gives SSMs competitive quality against transformers.

In 2026, SSMs moved from research curiosity to practical production option. The Mamba-3 release from CMU, Princeton, Cartesia AI, and Together AI in March 2026, along with AI21 Labs' Jamba hybrids and other SSM variants, gave teams access to models that handle 64K-128K contexts at a fraction of the GPU cost of equivalent transformers. For background on what KV cache pressure looks like in practice, see the KV cache optimization guide.

Mamba-3 Architecture: How Linear-Time Inference Changes GPU Economics

At the architecture level, Mamba-3 replaces the multi-head attention (MHA) and KV cache mechanism with a selective state space layer. Each layer maintains a state matrix that represents compressed information from all prior tokens. On each new token, the model:

Computes how much to update the state (the selective gate)
Reads from the state to generate output
Writes a new compressed representation back to the state

The state size is fixed, regardless of sequence length. That's the key property.

Mamba-3 specifically introduced a MIMO (multi-input, multi-output) SSM design over Mamba-2, improving how information is distributed across the state matrix. It also improved hardware utilization with better CUDA kernel alignment to the tensor core requirements of H100 and A100 GPUs. In practice this means Mamba-3 runs faster per token than Mamba-2 on the same hardware.

For GPU economics, the critical implication is the bottleneck shift. Transformers at long context are memory-bandwidth-bound: the GPU spends most of its time moving KV cache data between HBM and compute units. SSMs at long context are more compute-bound: the state update operation is a matrix multiplication over a fixed-size matrix, which plays to the GPU's FLOP throughput strengths.

H100 SXM5 has ~1.98 petaFLOPS of BF16 Tensor Core compute (with structured sparsity; ~989 TFLOPS dense) and 3.9 petaFLOPS with FP8 Tensor Cores (with structured sparsity; ~1,979 TFLOPS dense), but 3.35 TB/s of HBM bandwidth. For memory-bandwidth-bound workloads (long-context transformers), H200's 4.8 TB/s bandwidth gives a real throughput advantage worth the price premium. For SSMs, which are more compute-bound at long context, H100's price-to-performance ratio improves since H200's bandwidth premium goes unused. This changes which GPU tier makes sense for your workload. For a full treatment of the memory vs compute tradeoff, see the inference engineering guide.

Note on model sizes: Published Mamba-3 research covers models at approximately 1.5B parameters. The 7B, 34B, and 70B sizes used in this guide are illustrative projections based on SSM scaling behavior, not confirmed released variants. Before deploying, check the state-spaces HuggingFace organization for currently available model IDs and sizes.

GPU and VRAM Requirements for Mamba-3 vs Equivalent Transformer Models

SSMs need less VRAM than transformers at any context length. The formula is:

vram = (params_billions x bytes_per_dtype x 1.07) + state_overhead_gb

The 1.07 overhead factor accounts for activations and runtime buffers. SSMs use a smaller overhead than transformers (which use ~1.15) because they have no KV cache to reserve headroom for. For SSMs, state_overhead_gb is constant at approximately 2-4 GB regardless of sequence length. For transformers, KV cache adds:

kv_cache_gb = 2 x num_layers x num_kv_heads x head_dim x seq_len x batch_size x bytes_per_dtype / 1e9

For a 7B transformer at BF16, 32K context, batch size 4: roughly 2 x 32 x 8 x 128 x 32768 x 4 x 2 / 1e9 = ~17 GB KV cache on top of the ~16 GB weights. That's ~33 GB total. For the same 7B SSM at 32K context: ~16 GB weights + 3 GB state = ~19 GB total. For more detail on transformer VRAM math, see the GPU memory requirements guide.

Here's how the major model sizes compare across precisions:

Model	Params	Precision	VRAM (Weights + State/KV)	Minimum GPU	Context Limit
Mamba-3 7B	7B	BF16	~19 GB	L40S 48GB	128K+ (fixed state)
Mamba-3 7B	7B	FP8	~11 GB	L40S 48GB	128K+ (fixed state)
Transformer 7B	7B	BF16	~16 GB + KV cache	L40S 48GB	~32K before pressure
Transformer 7B	7B	FP8	~9 GB + KV cache	L40S 48GB	~32K before pressure
Mamba-3 34B	34B	BF16	~76 GB	A100 SXM4 80GB	128K+ (fixed state)
Mamba-3 34B	34B	FP8	~39 GB	A100 SXM4 80GB	128K+ (fixed state)
Transformer 34B	34B	BF16	~78 GB + KV cache	H100 SXM5 80GB	~16K before pressure
Mamba-3 large (70B+)	70B	FP8	~78 GB	H100 SXM5 80GB	128K+ (fixed state)

The "Context Limit" column tells the real story. Mamba-3 models have no effective context limit from VRAM growth. Transformers start hitting KV cache pressure at 16K-32K context on 80 GB GPUs, and need NVMe offloading or KV eviction to go longer. See the NVMe KV cache offloading guide for what that approach requires. SSMs skip that problem entirely.

GPU tier recommendations for Mamba-3 workloads:

Use Case	Recommended GPU	On-Demand Price	Spot Price
Mamba-3 7B, up to 128K context	L40S PCIe	$0.72/hr	N/A
Mamba-3 13-34B FP8	A100 SXM4 80GB	$1.64/hr	$0.45/hr
Mamba-3 34B BF16 or 70B FP8	H100 SXM5 80GB	$2.90/hr	$0.80/hr
Multi-instance or batch throughput	H100 SXM5, multi-GPU	$2.90/hr per GPU	$0.80/hr per GPU

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Deploying Mamba-3 on GPU Cloud with vLLM and SGLang

Prerequisites

Before deploying, you need:

A Spheron GPU instance (provision at app.spheron.ai)
Ubuntu 22.04 with NVIDIA drivers 535+
CUDA 12.x
Python 3.10+
Packages: vllm>=0.5.0, mamba-ssm, causal-conv1d

The mamba-ssm and causal-conv1d packages provide the custom CUDA kernels that implement the selective state space computation. Without them, vLLM falls back to a slower reference implementation. Both require CUDA to compile, so the NVIDIA toolkit must be installed before pip install.

Single-GPU Deployment with vLLM

Docker (recommended):

bash

docker run --gpus all \
  --ipc=host \
  --rm \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model <your-mamba-3-model-id> \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --port 8000

Bare metal:

bash

pip install "vllm>=0.5.0" mamba-ssm causal-conv1d

python -m vllm.entrypoints.openai.api_server \
  --model <your-mamba-3-model-id> \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --port 8000

For FP8 quantization (halves VRAM, small quality tradeoff):

bash

python -m vllm.entrypoints.openai.api_server \
  --model <your-mamba-3-model-id> \
  --dtype fp8 \
  --max-model-len 131072 \
  --port 8000

Note: Verify the exact HuggingFace model path before running. The state-spaces HuggingFace organization hosts the original Mamba series. Check there for the current Mamba model IDs available for deployment.

Multi-GPU Tensor Parallelism

SSMs have a different tensor parallelism profile than transformers. Transformers split KV cache across GPUs in addition to weights, so tensor parallelism helps with both VRAM and memory bandwidth. SSMs only split weights, because the state fits in a single GPU's VRAM at any context length.

For Mamba-3 34B BF16 on 2x A100 80GB:

bash

python -m vllm.entrypoints.openai.api_server \
  --model <your-mamba-3-34b-model-id> \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --port 8000

For 4x GPU:

bash

python -m vllm.entrypoints.openai.api_server \
  --model <your-mamba-3-34b-model-id> \
  --dtype bfloat16 \
  --tensor-parallel-size 4 \
  --max-model-len 131072 \
  --port 8000

On Spheron, SXM-variant GPUs use NVLink (900 GB/s bidirectional), which makes tensor parallelism efficient. PCIe Gen5-connected GPUs have up to 128 GB/s bandwidth (Gen4 GPUs like A100 PCIe and L40S have 64 GB/s), and will see higher inter-GPU communication overhead. For SSMs serving throughput-heavy workloads, prefer SXM variants when running more than 2-GPU tensor parallel. For a full vLLM production setup, see the vLLM production deployment guide.

SGLang Alternative

SGLang supports Mamba-series models with similar commands. SGLang's runtime scheduler has different batching behavior that can be beneficial for SSMs at mixed batch sizes:

bash

pip install sglang[all] mamba-ssm causal-conv1d

python -m sglang.launch_server \
  --model-path <your-mamba-3-model-id> \
  --dtype bfloat16 \
  --context-length 131072 \
  --port 8000

For a full SGLang production setup including load balancing and monitoring, see the SGLang production deployment guide.

Long Context Configuration

SSMs can run at 64K-128K+ context lengths without any special configuration because their memory use does not grow with sequence length. This is a direct contrast to transformer deployments, where you need KV cache eviction, CPU offloading, or NVMe offloading to reach those lengths on most GPUs.

To configure Mamba-3 for maximum context:

bash

python -m vllm.entrypoints.openai.api_server \
  --model <your-mamba-3-model-id> \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --port 8000

You can set --gpu-memory-utilization higher for SSMs than you would for transformers, because there is no KV cache that grows unboundedly. The main VRAM consumers are model weights plus a small fixed state buffer.

Benchmark Comparison: SSM vs Transformer Throughput, Latency, and Cost Per Token

These benchmarks are representative estimates based on published SSM research and scaling behavior. Run your own benchmarks on your target hardware before production decisions.

Table 1: Throughput at varying context lengths (tokens/second, Mamba-3 7B vs Llama-3 7B, single H100 SXM5)

Context Length	Mamba-3 7B (H100)	Llama-3 7B (H100)	SSM Advantage
2K tokens	~2,800 tok/s	~3,100 tok/s	~0.9x (transformer faster)
8K tokens	~2,750 tok/s	~2,400 tok/s	~1.1x
16K tokens	~2,700 tok/s	~1,100 tok/s	~2.5x
64K tokens	~2,600 tok/s	~350 tok/s	~7x
128K tokens	~2,500 tok/s	~120 tok/s	~20x

The crossover is around 8K tokens. Below that, transformers are competitive or faster. Above 16K, SSMs pull ahead and the gap compounds at longer lengths. For workloads where the average sequence is under 4K tokens, a transformer is likely the right choice.

Table 2: Cost per million tokens on Spheron GPUs

Model	GPU	Price/hr	Throughput (8K avg)	Cost/M tokens
Mamba-3 7B BF16	L40S PCIe (on-demand)	$0.72/hr	~1,800 tok/s	~$0.11/M
Mamba-3 34B FP8	A100 SXM4 (on-demand)	$1.64/hr	~800 tok/s	~$0.57/M
Llama-3 7B BF16	L40S PCIe (on-demand)	$0.72/hr	~1,200 tok/s (at 8K)	~$0.17/M
Llama-3 34B FP8	A100 SXM4 (on-demand)	$1.64/hr	~400 tok/s (at 8K)	~$1.14/M

At 8K context, Mamba-3 7B on-demand is roughly 1.5x cheaper per token than an equivalent transformer at the same GPU tier. At 64K context, the gap is closer to ~7x.

Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For a full look at inference cost optimization techniques, see the AI inference cost economics guide.

When to Use SSMs vs Transformers: Decision Framework for Production Workloads

The core tradeoff is simple: SSMs win at long context and lose at ecosystem maturity. Transformers win at short context and have dramatically better tooling.

Criterion	Use SSM (Mamba-3)	Use Transformer
Typical context length	Over 16K tokens	Under 4K tokens
VRAM budget	Constrained (under 80 GB)	Flexible
Primary workload	Document analysis, summarization	Short-form generation, chat
Throughput priority	Long-document throughput	Concurrent short requests
Ecosystem requirement	Flexible	Strict (fine-tuning tooling, adapters)
Instruction following	Simple tasks	Complex multi-step reasoning
Fine-tuning	Limited options	Full ecosystem (LoRA, PEFT, Axolotl)
Serving framework	vLLM 0.5+, SGLang (limited)	Full ecosystem

The biggest practical constraint on SSMs right now is ecosystem maturity. Fine-tuning Mamba-3 requires more effort than fine-tuning a Llama-3 equivalent. Adapter formats like LoRA are not as mature. If you need to customize the model or run complex agentic tasks with tool use, a transformer is safer.

For pure inference at long context on fixed data, SSMs are the practical choice. The cost savings are real and the deployment complexity is comparable.

TTT layers use a different hidden-state approach: rather than a fixed SSM transition, the hidden state is itself a small model that gets gradient-updated on the input at inference time. The TTT on GPU cloud guide covers how TTT-Linear and TTT-MLP compare to Mamba's recurrent state for long-context workloads.

For a broader GPU selection decision across use cases, see Best GPU for AI Inference in 2026. For a comparison of pricing models across on-demand, spot, and reserved, see Serverless GPU vs On-Demand vs Reserved.

Hybrid Architectures: Combining Mamba Layers with Attention for Best of Both Worlds

Pure SSMs make a quality tradeoff. For tasks requiring strong multi-step reasoning or precise information retrieval across long contexts, the fixed-size state sometimes loses information that a full attention KV cache would retain. Hybrid architectures address this.

AI21 Labs' Jamba and the Zamba model family interleave SSM layers with sparse attention layers. The pattern is typically 7-8 SSM layers followed by 1 attention layer. This gives the model selective access to the full KV cache at attention layers while handling the bulk of computation through efficient SSM layers.

GPU implications of hybrid models:

VRAM is closer to a pure transformer than a pure SSM, because the attention layers still generate KV cache entries
However, the KV cache grows at 1/8th to 1/10th the rate of a full transformer (only attention layers contribute)
At 64K context, a Jamba-style hybrid uses roughly 10-15% of the KV cache a full transformer would require
Throughput at long context falls between pure SSM and full transformer, typically 2-4x better than the transformer

Deployment is the same as for pure SSMs: use vLLM or SGLang with the hybrid model's HuggingFace ID. Check the model card for any specific package dependencies, as some hybrids use both mamba-ssm and standard attention kernels.

For workloads requiring better instruction following than pure SSMs but still needing linear context scaling, hybrids are the practical middle ground. Jamba also uses MoE layers in its architecture. For the MoE-specific deployment considerations, see MoE Inference Optimization on GPU Cloud.

Getting Started with Mamba-3 on Spheron GPU Cloud

Mamba-3 changes which GPU tier makes sense for your workload. Because SSMs are more compute-bound than memory-bandwidth-bound at long context, H100's lower cost gives better price-to-performance than H200 for long-context SSM inference, since H200's bandwidth premium goes unused. This is the opposite of the recommendation for long-context transformer deployments.

Right-sizing for SSM workloads specifically:

Workload	GPU Recommendation	Notes
Mamba-3 7B, short context (<8K), cost-priority	L40S PCIe on-demand ($0.72/hr)	Good throughput, competitive price
Mamba-3 7B, long context (16K-128K)	L40S PCIe on-demand ($0.72/hr)	No KV cache pressure at any length
Mamba-3 34B FP8, production serving	A100 SXM4 ($1.64/hr)	Higher compute than PCIe variant
Mamba-3 70B+ FP8	H100 SXM5 ($2.90/hr)	Best compute-to-cost for large SSMs
Batch processing at scale	H100 SXM5 spot ($0.80/hr)	Spot pricing for non-latency-sensitive jobs

A100 performs comparatively better for SSM workloads than for transformer workloads. Transformers at long context stress memory bandwidth, where H100 and H200 outperform A100. SSMs at long context stress compute, where A100 is a closer competitor. For 34B-class SSMs at long context, A100 SXM4 is a reasonable choice before stepping up to H100 for larger models.

Quick start:

Provision a Spheron GPU instance at app.spheron.ai. Select your GPU based on the table above.
SSH in, verify your GPU with nvidia-smi, confirm CUDA 12+ is available.
Install: pip install "vllm>=0.5.0" mamba-ssm causal-conv1d
Launch: python -m vllm.entrypoints.openai.api_server --model <your-mamba-3-model-id> --dtype bfloat16 --max-model-len 131072 --port 8000
Test: curl http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{"model": "<model-id>", "prompt": "Summarize:", "max_tokens": 200}'

Deployment templates and configuration guides are available at docs.spheron.ai.

SSMs like Mamba-3 run efficiently on GPUs that are "too small" for equivalent transformers at long context. Spheron's GPU catalog gives you the flexibility to right-size for your workload, from L40S for small SSM models to H100 clusters for large-scale inference.
Check H100 availability → | A100 80GB on Spheron → | View all GPU pricing →
Get started on Spheron →

STEPS / 05

Quick Setup Guide

Provision a GPU instance on Spheron
Log into app.spheron.ai, select a GPU model appropriate for your Mamba-3 variant (L40S for 7B, A100 for 13-34B, H100 SXM5 for larger variants), and deploy an Ubuntu 22.04 instance with the NVIDIA Docker runtime template.
Install vLLM with Mamba support
Run: pip install "vllm>=0.5.0" mamba-ssm causal-conv1d. These packages provide the CUDA kernels needed for selective state space computation.
Launch the Mamba-3 inference server
Run: python -m vllm.entrypoints.openai.api_server --model <your-mamba-3-model-id> --dtype bfloat16 --max-model-len 131072 --port 8000. Adjust --tensor-parallel-size for multi-GPU deployments. Check the state-spaces HuggingFace organization for available Mamba model IDs.
Send a test request
Use curl to hit the OpenAI-compatible endpoint: curl http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{"model": "<your-mamba-3-model-id>", "prompt": "Hello, Mamba-3:", "max_tokens": 100}'
Benchmark throughput at long context
Run vLLM's built-in benchmark script: python benchmarks/benchmark_throughput.py --model <your-mamba-3-model-id> --input-len 16384 --output-len 512 --num-prompts 50. Compare output tokens/second against an equivalent transformer at the same sequence length.

FAQ / 05

Frequently Asked Questions

Mamba-3 models typically require 20-40% less VRAM than a transformer of equivalent parameter count because they replace the quadratic attention mechanism with a fixed-size state. A Mamba-3 7B model runs comfortably on a single L40S (48GB), whereas a transformer 7B at full precision needs the same but scales worse at longer contexts.

Yes. vLLM added Mamba-series model support in recent releases. You can deploy Mamba-3 with the standard vLLM serve command by pointing to the HuggingFace model ID. For multi-GPU setups, use --tensor-parallel-size matching your GPU count.

For Mamba-3 7B-class models, an L40S (48GB) or A100 PCIe (80GB) is cost-effective on-demand. For larger Mamba-3 variants above 34B parameters, an H100 SXM5 (80GB) gives the best throughput. SSMs are more compute-bound than memory-bound at long sequences, so higher compute GPUs like the H100 outperform memory-optimized picks at long context.

At short contexts (under 2K tokens), transformers and SSMs have similar throughput. At 16K+ tokens, SSMs pull significantly ahead because their inference cost scales linearly with sequence length rather than quadratically. At 64K tokens, a Mamba-3 model can be 4-8x faster than an equivalent transformer on the same GPU.

Hybrid models like Jamba (SSM layers + sparse attention layers) have mixed compute profiles. They are slightly more memory-bound than pure SSMs due to the attention layers, but significantly more efficient than full transformers at long context. Sizing follows the pure-transformer VRAM rules but you can expect 2-3x better long-context throughput than a comparable dense transformer.

What Are State Space Models and Why They Challenge Transformers in 2026

Mamba-3 Architecture: How Linear-Time Inference Changes GPU Economics

GPU and VRAM Requirements for Mamba-3 vs Equivalent Transformer Models

Deploying Mamba-3 on GPU Cloud with vLLM and SGLang

Prerequisites

Single-GPU Deployment with vLLM

Multi-GPU Tensor Parallelism

SGLang Alternative

Long Context Configuration

Benchmark Comparison: SSM vs Transformer Throughput, Latency, and Cost Per Token

When to Use SSMs vs Transformers: Decision Framework for Production Workloads

Hybrid Architectures: Combining Mamba Layers with Attention for Best of Both Worlds

Getting Started with Mamba-3 on Spheron GPU Cloud

Quick Setup Guide

Provision a GPU instance on Spheron

Install vLLM with Mamba support

Launch the Mamba-3 inference server

Send a test request

Benchmark throughput at long context

Frequently Asked Questions

01How much VRAM does Mamba-3 require compared to a transformer of the same size?

02Can I run Mamba-3 inference with vLLM?

03What GPU should I use for Mamba-3 inference?

04How does SSM throughput compare to transformers at long context lengths?

05Do hybrid SSM-attention models like Jamba require different GPU configurations?

Build what's next.