Most GPU inference guides assume transformers. That assumption is starting to cost teams money.
State space models (SSMs) like Mamba-3, released by researchers from CMU, Princeton, Cartesia AI, and Together AI in March 2026, offer a different architecture tradeoff: linear-time inference that gets cheaper as context length grows rather than more expensive. For workloads with long context windows, the GPU economics flip. A model that needs 6x the compute of a transformer at 2K context may need half the compute at 64K.
This guide covers how to size GPUs for Mamba-3 and SSM workloads, how to deploy with vLLM and SGLang, and when to actually use SSMs instead of a transformer. We also compare cost per token at long context lengths to give you a concrete decision framework. For background on the memory bandwidth bottleneck in LLM inference, see the AI memory wall inference guide.
What Are State Space Models and Why They Challenge Transformers in 2026
Transformers have a fundamental scaling problem at long context: the KV cache. Every new token added to a sequence requires the model to attend to all previous tokens. At 128K context, a 7B transformer is doing quadratically more work than it did at 2K context. VRAM use scales with sequence length, throughput collapses, and you end up buying more GPU just to handle longer documents.
SSMs work differently. Instead of storing every past token in a KV cache, they compress the past into a fixed-size recurrent state. The state size does not grow with sequence length. Whether you process 1K tokens or 128K tokens, the memory overhead of the recurrent state stays constant at 2-4 GB, compared to tens of gigabytes for a transformer KV cache at the same length.
This is not a new idea. RNNs and LSTMs used the same recurrent-state approach. What Mamba and its successors got right was selectivity: the state can learn which information to remember and which to discard, based on the input. That selective mechanism (called S4, S6, and in Mamba-3, a multi-head variant) is what gives SSMs competitive quality against transformers.
In 2026, SSMs moved from research curiosity to practical production option. The Mamba-3 release from CMU, Princeton, Cartesia AI, and Together AI in March 2026, along with AI21 Labs' Jamba hybrids and other SSM variants, gave teams access to models that handle 64K-128K contexts at a fraction of the GPU cost of equivalent transformers. For background on what KV cache pressure looks like in practice, see the KV cache optimization guide.
Mamba-3 Architecture: How Linear-Time Inference Changes GPU Economics
At the architecture level, Mamba-3 replaces the multi-head attention (MHA) and KV cache mechanism with a selective state space layer. Each layer maintains a state matrix that represents compressed information from all prior tokens. On each new token, the model:
- Computes how much to update the state (the selective gate)
- Reads from the state to generate output
- Writes a new compressed representation back to the state
The state size is fixed, regardless of sequence length. That's the key property.
Mamba-3 specifically introduced a MIMO (multi-input, multi-output) SSM design over Mamba-2, improving how information is distributed across the state matrix. It also improved hardware utilization with better CUDA kernel alignment to the tensor core requirements of H100 and A100 GPUs. In practice this means Mamba-3 runs faster per token than Mamba-2 on the same hardware.
For GPU economics, the critical implication is the bottleneck shift. Transformers at long context are memory-bandwidth-bound: the GPU spends most of its time moving KV cache data between HBM and compute units. SSMs at long context are more compute-bound: the state update operation is a matrix multiplication over a fixed-size matrix, which plays to the GPU's FLOP throughput strengths.
H100 SXM5 has ~1.98 petaFLOPS of BF16 Tensor Core compute (with structured sparsity; ~989 TFLOPS dense) and 3.9 petaFLOPS with FP8 Tensor Cores (with structured sparsity; ~1,979 TFLOPS dense), but 3.35 TB/s of HBM bandwidth. For memory-bandwidth-bound workloads (long-context transformers), H200's 4.8 TB/s bandwidth gives a real throughput advantage worth the price premium. For SSMs, which are more compute-bound at long context, H100's price-to-performance ratio improves since H200's bandwidth premium goes unused. This changes which GPU tier makes sense for your workload. For a full treatment of the memory vs compute tradeoff, see the inference engineering guide.
Note on model sizes: Published Mamba-3 research covers models at approximately 1.5B parameters. The 7B, 34B, and 70B sizes used in this guide are illustrative projections based on SSM scaling behavior, not confirmed released variants. Before deploying, check the state-spaces HuggingFace organization for currently available model IDs and sizes.
GPU and VRAM Requirements for Mamba-3 vs Equivalent Transformer Models
SSMs need less VRAM than transformers at any context length. The formula is:
vram = (params_billions x bytes_per_dtype x 1.07) + state_overhead_gbThe 1.07 overhead factor accounts for activations and runtime buffers. SSMs use a smaller overhead than transformers (which use ~1.15) because they have no KV cache to reserve headroom for. For SSMs, state_overhead_gb is constant at approximately 2-4 GB regardless of sequence length. For transformers, KV cache adds:
kv_cache_gb = 2 x num_layers x num_kv_heads x head_dim x seq_len x batch_size x bytes_per_dtype / 1e9For a 7B transformer at BF16, 32K context, batch size 4: roughly 2 x 32 x 8 x 128 x 32768 x 4 x 2 / 1e9 = ~17 GB KV cache on top of the ~16 GB weights. That's ~33 GB total. For the same 7B SSM at 32K context: ~16 GB weights + 3 GB state = ~19 GB total. For more detail on transformer VRAM math, see the GPU memory requirements guide.
Here's how the major model sizes compare across precisions:
| Model | Params | Precision | VRAM (Weights + State/KV) | Minimum GPU | Context Limit |
|---|---|---|---|---|---|
| Mamba-3 7B | 7B | BF16 | ~19 GB | L40S 48GB | 128K+ (fixed state) |
| Mamba-3 7B | 7B | FP8 | ~11 GB | L40S 48GB | 128K+ (fixed state) |
| Transformer 7B | 7B | BF16 | ~16 GB + KV cache | L40S 48GB | ~32K before pressure |
| Transformer 7B | 7B | FP8 | ~9 GB + KV cache | L40S 48GB | ~32K before pressure |
| Mamba-3 34B | 34B | BF16 | ~76 GB | A100 SXM4 80GB | 128K+ (fixed state) |
| Mamba-3 34B | 34B | FP8 | ~39 GB | A100 SXM4 80GB | 128K+ (fixed state) |
| Transformer 34B | 34B | BF16 | ~78 GB + KV cache | H100 SXM5 80GB | ~16K before pressure |
| Mamba-3 large (70B+) | 70B | FP8 | ~78 GB | H100 SXM5 80GB | 128K+ (fixed state) |
The "Context Limit" column tells the real story. Mamba-3 models have no effective context limit from VRAM growth. Transformers start hitting KV cache pressure at 16K-32K context on 80 GB GPUs, and need NVMe offloading or KV eviction to go longer. See the NVMe KV cache offloading guide for what that approach requires. SSMs skip that problem entirely.
GPU tier recommendations for Mamba-3 workloads:
| Use Case | Recommended GPU | On-Demand Price | Spot Price |
|---|---|---|---|
| Mamba-3 7B, up to 128K context | L40S PCIe | $0.72/hr | N/A |
| Mamba-3 13-34B FP8 | A100 SXM4 80GB | $1.64/hr | $0.45/hr |
| Mamba-3 34B BF16 or 70B FP8 | H100 SXM5 80GB | $2.90/hr | $0.80/hr |
| Multi-instance or batch throughput | H100 SXM5, multi-GPU | $2.90/hr per GPU | $0.80/hr per GPU |
Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Deploying Mamba-3 on GPU Cloud with vLLM and SGLang
Prerequisites
Before deploying, you need:
- A Spheron GPU instance (provision at app.spheron.ai)
- Ubuntu 22.04 with NVIDIA drivers 535+
- CUDA 12.x
- Python 3.10+
- Packages:
vllm>=0.5.0,mamba-ssm,causal-conv1d
The mamba-ssm and causal-conv1d packages provide the custom CUDA kernels that implement the selective state space computation. Without them, vLLM falls back to a slower reference implementation. Both require CUDA to compile, so the NVIDIA toolkit must be installed before pip install.
Single-GPU Deployment with vLLM
Docker (recommended):
docker run --gpus all \
--ipc=host \
--rm \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model <your-mamba-3-model-id> \
--dtype bfloat16 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--port 8000Bare metal:
pip install "vllm>=0.5.0" mamba-ssm causal-conv1d
python -m vllm.entrypoints.openai.api_server \
--model <your-mamba-3-model-id> \
--dtype bfloat16 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--port 8000For FP8 quantization (halves VRAM, small quality tradeoff):
python -m vllm.entrypoints.openai.api_server \
--model <your-mamba-3-model-id> \
--dtype fp8 \
--max-model-len 131072 \
--port 8000Note: Verify the exact HuggingFace model path before running. The state-spaces HuggingFace organization hosts the original Mamba series. Check there for the current Mamba model IDs available for deployment.
Multi-GPU Tensor Parallelism
SSMs have a different tensor parallelism profile than transformers. Transformers split KV cache across GPUs in addition to weights, so tensor parallelism helps with both VRAM and memory bandwidth. SSMs only split weights, because the state fits in a single GPU's VRAM at any context length.
For Mamba-3 34B BF16 on 2x A100 80GB:
python -m vllm.entrypoints.openai.api_server \
--model <your-mamba-3-34b-model-id> \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--port 8000For 4x GPU:
python -m vllm.entrypoints.openai.api_server \
--model <your-mamba-3-34b-model-id> \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--max-model-len 131072 \
--port 8000On Spheron, SXM-variant GPUs use NVLink (900 GB/s bidirectional), which makes tensor parallelism efficient. PCIe Gen5-connected GPUs have up to 128 GB/s bandwidth (Gen4 GPUs like A100 PCIe and L40S have 64 GB/s), and will see higher inter-GPU communication overhead. For SSMs serving throughput-heavy workloads, prefer SXM variants when running more than 2-GPU tensor parallel. For a full vLLM production setup, see the vLLM production deployment guide.
SGLang Alternative
SGLang supports Mamba-series models with similar commands. SGLang's runtime scheduler has different batching behavior that can be beneficial for SSMs at mixed batch sizes:
pip install sglang[all] mamba-ssm causal-conv1d
python -m sglang.launch_server \
--model-path <your-mamba-3-model-id> \
--dtype bfloat16 \
--context-length 131072 \
--port 8000For a full SGLang production setup including load balancing and monitoring, see the SGLang production deployment guide.
Long Context Configuration
SSMs can run at 64K-128K+ context lengths without any special configuration because their memory use does not grow with sequence length. This is a direct contrast to transformer deployments, where you need KV cache eviction, CPU offloading, or NVMe offloading to reach those lengths on most GPUs.
To configure Mamba-3 for maximum context:
python -m vllm.entrypoints.openai.api_server \
--model <your-mamba-3-model-id> \
--dtype bfloat16 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--port 8000You can set --gpu-memory-utilization higher for SSMs than you would for transformers, because there is no KV cache that grows unboundedly. The main VRAM consumers are model weights plus a small fixed state buffer.
Benchmark Comparison: SSM vs Transformer Throughput, Latency, and Cost Per Token
These benchmarks are representative estimates based on published SSM research and scaling behavior. Run your own benchmarks on your target hardware before production decisions.
Table 1: Throughput at varying context lengths (tokens/second, Mamba-3 7B vs Llama-3 7B, single H100 SXM5)
| Context Length | Mamba-3 7B (H100) | Llama-3 7B (H100) | SSM Advantage |
|---|---|---|---|
| 2K tokens | ~2,800 tok/s | ~3,100 tok/s | ~0.9x (transformer faster) |
| 8K tokens | ~2,750 tok/s | ~2,400 tok/s | ~1.1x |
| 16K tokens | ~2,700 tok/s | ~1,100 tok/s | ~2.5x |
| 64K tokens | ~2,600 tok/s | ~350 tok/s | ~7x |
| 128K tokens | ~2,500 tok/s | ~120 tok/s | ~20x |
The crossover is around 8K tokens. Below that, transformers are competitive or faster. Above 16K, SSMs pull ahead and the gap compounds at longer lengths. For workloads where the average sequence is under 4K tokens, a transformer is likely the right choice.
Table 2: Cost per million tokens on Spheron GPUs
| Model | GPU | Price/hr | Throughput (8K avg) | Cost/M tokens |
|---|---|---|---|---|
| Mamba-3 7B BF16 | L40S PCIe (on-demand) | $0.72/hr | ~1,800 tok/s | ~$0.11/M |
| Mamba-3 34B FP8 | A100 SXM4 (on-demand) | $1.64/hr | ~800 tok/s | ~$0.57/M |
| Llama-3 7B BF16 | L40S PCIe (on-demand) | $0.72/hr | ~1,200 tok/s (at 8K) | ~$0.17/M |
| Llama-3 34B FP8 | A100 SXM4 (on-demand) | $1.64/hr | ~400 tok/s (at 8K) | ~$1.14/M |
At 8K context, Mamba-3 7B on-demand is roughly 1.5x cheaper per token than an equivalent transformer at the same GPU tier. At 64K context, the gap is closer to ~7x.
Pricing fluctuates based on GPU availability. The prices above are based on 18 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
For a full look at inference cost optimization techniques, see the AI inference cost economics guide.
When to Use SSMs vs Transformers: Decision Framework for Production Workloads
The core tradeoff is simple: SSMs win at long context and lose at ecosystem maturity. Transformers win at short context and have dramatically better tooling.
| Criterion | Use SSM (Mamba-3) | Use Transformer |
|---|---|---|
| Typical context length | Over 16K tokens | Under 4K tokens |
| VRAM budget | Constrained (under 80 GB) | Flexible |
| Primary workload | Document analysis, summarization | Short-form generation, chat |
| Throughput priority | Long-document throughput | Concurrent short requests |
| Ecosystem requirement | Flexible | Strict (fine-tuning tooling, adapters) |
| Instruction following | Simple tasks | Complex multi-step reasoning |
| Fine-tuning | Limited options | Full ecosystem (LoRA, PEFT, Axolotl) |
| Serving framework | vLLM 0.5+, SGLang (limited) | Full ecosystem |
The biggest practical constraint on SSMs right now is ecosystem maturity. Fine-tuning Mamba-3 requires more effort than fine-tuning a Llama-3 equivalent. Adapter formats like LoRA are not as mature. If you need to customize the model or run complex agentic tasks with tool use, a transformer is safer.
For pure inference at long context on fixed data, SSMs are the practical choice. The cost savings are real and the deployment complexity is comparable.
For a broader GPU selection decision across use cases, see Best GPU for AI Inference in 2026. For a comparison of pricing models across on-demand, spot, and reserved, see Serverless GPU vs On-Demand vs Reserved.
Hybrid Architectures: Combining Mamba Layers with Attention for Best of Both Worlds
Pure SSMs make a quality tradeoff. For tasks requiring strong multi-step reasoning or precise information retrieval across long contexts, the fixed-size state sometimes loses information that a full attention KV cache would retain. Hybrid architectures address this.
AI21 Labs' Jamba and the Zamba model family interleave SSM layers with sparse attention layers. The pattern is typically 7-8 SSM layers followed by 1 attention layer. This gives the model selective access to the full KV cache at attention layers while handling the bulk of computation through efficient SSM layers.
GPU implications of hybrid models:
- VRAM is closer to a pure transformer than a pure SSM, because the attention layers still generate KV cache entries
- However, the KV cache grows at 1/8th to 1/10th the rate of a full transformer (only attention layers contribute)
- At 64K context, a Jamba-style hybrid uses roughly 10-15% of the KV cache a full transformer would require
- Throughput at long context falls between pure SSM and full transformer, typically 2-4x better than the transformer
Deployment is the same as for pure SSMs: use vLLM or SGLang with the hybrid model's HuggingFace ID. Check the model card for any specific package dependencies, as some hybrids use both mamba-ssm and standard attention kernels.
For workloads requiring better instruction following than pure SSMs but still needing linear context scaling, hybrids are the practical middle ground. Jamba also uses MoE layers in its architecture. For the MoE-specific deployment considerations, see MoE Inference Optimization on GPU Cloud.
Getting Started with Mamba-3 on Spheron GPU Cloud
Mamba-3 changes which GPU tier makes sense for your workload. Because SSMs are more compute-bound than memory-bandwidth-bound at long context, H100's lower cost gives better price-to-performance than H200 for long-context SSM inference, since H200's bandwidth premium goes unused. This is the opposite of the recommendation for long-context transformer deployments.
Right-sizing for SSM workloads specifically:
| Workload | GPU Recommendation | Notes |
|---|---|---|
| Mamba-3 7B, short context (<8K), cost-priority | L40S PCIe on-demand ($0.72/hr) | Good throughput, competitive price |
| Mamba-3 7B, long context (16K-128K) | L40S PCIe on-demand ($0.72/hr) | No KV cache pressure at any length |
| Mamba-3 34B FP8, production serving | A100 SXM4 ($1.64/hr) | Higher compute than PCIe variant |
| Mamba-3 70B+ FP8 | H100 SXM5 ($2.90/hr) | Best compute-to-cost for large SSMs |
| Batch processing at scale | H100 SXM5 spot ($0.80/hr) | Spot pricing for non-latency-sensitive jobs |
A100 performs comparatively better for SSM workloads than for transformer workloads. Transformers at long context stress memory bandwidth, where H100 and H200 outperform A100. SSMs at long context stress compute, where A100 is a closer competitor. For 34B-class SSMs at long context, A100 SXM4 is a reasonable choice before stepping up to H100 for larger models.
Quick start:
- Provision a Spheron GPU instance at app.spheron.ai. Select your GPU based on the table above.
- SSH in, verify your GPU with
nvidia-smi, confirm CUDA 12+ is available. - Install:
pip install "vllm>=0.5.0" mamba-ssm causal-conv1d - Launch:
python -m vllm.entrypoints.openai.api_server --model <your-mamba-3-model-id> --dtype bfloat16 --max-model-len 131072 --port 8000 - Test:
curl http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{"model": "<model-id>", "prompt": "Summarize:", "max_tokens": 200}'
Deployment templates and configuration guides are available at docs.spheron.ai.
SSMs like Mamba-3 run efficiently on GPUs that are "too small" for equivalent transformers at long context. Spheron's GPU catalog gives you the flexibility to right-size for your workload, from L40S for small SSM models to H100 clusters for large-scale inference.
