Deploy xLSTM and RWKV-7 on GPU Cloud: Linear-Attention Alternatives for Million-Token Context Inference (2026)

Most inference guides assume transformers. For long-context workloads, that assumption is becoming expensive.

State space models like Mamba-3 showed that recurrent-state architectures can outperform transformers on million-token context tasks at a fraction of the GPU cost. Two other architectures released in 2026 belong in that same conversation: xLSTM 7B from NXAI and RWKV-7 World v3 from BlinkDL. Both use linear-attention mechanisms, both maintain fixed-size recurrent state that does not grow with context length, and both have production-ready inference runtimes that work today. Our Mamba-3 GPU deployment guide covers the SSM baseline; this post covers the two highest-profile linear-attention additions of 2026. Liquid AI's LFM family is a third non-transformer architecture worth evaluating alongside xLSTM and RWKV-7 - see the Liquid Foundation Model deployment guide for the production setup.

A fourth architecture that belongs in this conversation is TTT (Test-Time Training), which takes the fixed-state idea one step further: the hidden state is itself a small gradient-updated model. See the TTT deployment guide for how TTT-Linear compares to xLSTM and Mamba on long-context GPU inference. A fifth worth tracking is SubQ: unlike xLSTM and RWKV-7, which use fixed-size recurrent state, SubQ retains a growing KV cache but makes it linear rather than quadratic. The SubQ 1M-Preview deployment guide covers how that distinction plays out in VRAM sizing and TTFT at 12M token context. A sixth architecture emerging in mid-2026 is log-linear attention, which achieves O(N log N) scaling via hierarchical state - see the log-linear attention inference guide for a comparison against xLSTM and Mamba-3.

This guide walks through VRAM sizing, runtime setup with the transformers + xlstm package and rwkv.cpp, throughput benchmarks at multiple context lengths, and a cost-per-million-token comparison against Llama 3.3 70B. For context on the memory bandwidth bottleneck that makes these architectures attractive, see the AI memory wall inference guide.

Why Linear-Attention Architectures Matter in 2026

The transformer KV cache has a fundamental scaling problem. Every new token added to a sequence requires storing key-value pairs for all previous tokens, and the VRAM required grows quadratically with sequence length.

A 7B transformer at BF16, processing a 128K-token sequence at batch size 4, generates a KV cache of approximately:

kv_cache_gb = 2 x 32 layers x 8 kv_heads x 128 head_dim x 131072 seq_len x 4 batch x 2 bytes / 1e9
            ≈ 69 GB

That is ~69 GB of KV cache on top of the ~15 GB of model weights, totalling ~84 GB. A single H100 SXM5 with 80 GB of HBM cannot fit this. You either need KV cache eviction, NVMe offloading, or a fundamentally different architecture. For the transformer-side mitigations, see the KV cache optimization guide.

Linear-attention architectures take the alternative path. Instead of storing every past token in a cache, they compress the sequence history into a fixed-size recurrent state. For xLSTM, this is a matrix memory cell with exponential gating. For RWKV-7, it is a time-mixing mechanism with linear attention and a token-shift operation. The recurrent state size for a 7B model is 2-4 GB regardless of whether you have processed 2K tokens or 2M tokens.

The GPU economics flip at long context. A model that needs 6x the compute of a transformer at 2K context may need half the compute at 64K, because the transformer is drowning under its KV cache while the linear-attention model operates at constant memory overhead.

xLSTM vs RWKV-7 vs Mamba-3: Architecture Comparison

All three are linear-attention architectures in the sense that their inference complexity scales linearly with sequence length, not quadratically. Beyond that, they differ significantly in how they implement the recurrent state.

Property	xLSTM 7B	RWKV-7 World v3	Mamba-3
Architecture type	Extended LSTM with matrix memory	Linear attention with time-mixing	Selective state space (SSM)
State size (7B BF16)	low single-digit GB (fixed)	low single-digit GB (fixed)	low single-digit GB (fixed)
Context scaling	O(1) VRAM, O(n) compute	O(1) VRAM, O(n) compute	O(1) VRAM, O(n) compute
VRAM overhead at 128K context	Fixed (no growth)	Fixed (no growth)	Fixed (no growth)
Primary 2026 release	NX-AI/xLSTM-7b	BlinkDL/rwkv-7-world	state-spaces/mamba-3
Primary runtime	transformers + xlstm	rwkv.cpp, ChatRWKV	vLLM 0.5+
vLLM support	Not supported (May 2026)	Limited	Full support
Framework	PyTorch (transformers + xlstm)	C++ GGML + Python	PyTorch

xLSTM: Matrix Memory Cells and Exponential Gating

xLSTM extends the classic LSTM by replacing the scalar cell state with a matrix memory. Each layer maintains a matrix C_t that stores compressed representations of past tokens. On each new token, the model computes an exponential gate that controls how much new information updates the matrix versus how much old information is retained.

The matrix structure gives xLSTM more representational capacity than classic LSTM or GRU variants. The exponential gating (using exp(q_t^T k_t) rather than sigmoid gates) is stable in BF16 and gives the model finer control over information retention across very long sequences.

The inference runtime is the HuggingFace transformers library combined with the xlstm package, which provides Triton-based GPU kernels. This avoids JAX/XLA compilation overhead and loads at the same speed as any other transformers model.

RWKV-7: Time-Mixing with Linear Attention

RWKV-7 uses a different mechanism called time-mixing. Each layer applies a learned time-decay to the recurrent state, controlling how quickly past information fades. The token-shift mechanism adds a mix of the previous token's representation into the current token's computation, giving the model a two-step temporal context without full attention.

RWKV-7 World v3 is the production-ready variant, trained on a multilingual corpus with strong instruction following. The runtime is rwkv.cpp (for quantized inference) or ChatRWKV (for the Python serving stack). Both are simpler to deploy for quantized inference than the transformers approach because they use GGML kernels with tight memory control.

The key production advantage of RWKV-7 is stateful multi-turn serving: the model's recurrent state after processing a user's message can be saved and resumed on the next turn. A transformer serving system with KV cache does something similar, but transformer KV caches are proportional to context length. RWKV-7's state is fixed-size regardless of how many turns have passed.

For the broader Mamba-3 architecture and SSM background, see the Mamba-3 deployment guide.

Hardware Sizing: VRAM, Recurrent State, and GPU Selection

The VRAM formula for linear-attention models follows the same pattern as Mamba-3 but with different state overhead:

vram_gb = (params_billions x bytes_per_dtype x 1.07) + state_gb

For xLSTM 7B at BF16: (7 x 2 x 1.07) + state_gb ≈ 15-18 GB (state is low single-digit GB)

For RWKV-7 7B at BF16: (7 x 2 x 1.07) + state_gb ≈ 15-17 GB (state is low single-digit GB)

The 1.07 overhead factor covers activations and runtime buffers. Linear-attention models use a smaller buffer overhead than transformers (which use ~1.15) because there is no KV cache to reserve headroom for. State size is constant regardless of sequence length.

For comparison, a transformer 7B at BF16, processing a 16K context at batch size 4, needs approximately 16 GB weights plus 17 GB KV cache, for ~33 GB total. For more on GPU VRAM math for standard LLMs, see the GPU memory requirements guide.

Model	Params	Precision	VRAM (Weights + State)	Minimum GPU	Context Limit
xLSTM 7B	7B	BF16	~18 GB	L40S 48 GB	Unlimited (fixed state)
xLSTM 7B	7B	FP8	~10 GB	L40S 48 GB	Unlimited (fixed state)
RWKV-7 7B	7B	BF16	~17 GB	L40S 48 GB	Unlimited (fixed state)
RWKV-7 7B	7B	INT8	~9 GB	L40S 48 GB	Unlimited (fixed state)
Llama 3.1 8B	8B	BF16	~16 GB + KV cache	L40S 48 GB	~32K before pressure

The "Unlimited" context limit for linear-attention models is literal: VRAM does not grow with sequence length. The same L40S that handles a 2K conversation handles a 128K document analysis without any configuration change. Compare that to the transformer 7B, which needs KV cache management strategies above 32K on a 48 GB GPU.

GPU tier recommendations:

Use Case	Recommended GPU	On-Demand Price	Spot Price
xLSTM 7B or RWKV-7 7B, dev/test	L40S PCIe	$0.72/hr	N/A
xLSTM 7B or RWKV-7 7B, production	A100 SXM4 80 GB	$1.70/hr	N/A
High-throughput serving, batch 8+	H100 SXM5 on Spheron	$3.10/hr	N/A
Mixed transformer + linear-attention fleet	H200 GPU rental	$2.51/hr	$1.19/hr

Pricing fluctuates based on GPU availability. The prices above are based on 06 May 2026 and may have changed. Check current GPU pricing → for live rates.

The H100 vs H200 tradeoff for these workloads mirrors the Mamba-3 case. Both xLSTM and RWKV-7 are more compute-bound than memory-bandwidth-bound at long context, because the state update is a matrix multiplication over a fixed-size buffer rather than a streaming read of a large KV cache. H200's 4.8 TB/s bandwidth premium over H100's 3.35 TB/s matters less here. For pure linear-attention serving, H100 gives better price-to-compute than H200. H200 becomes the right pick if you are running a mixed fleet that also includes long-context transformer workloads, where H200's bandwidth advantage directly translates to throughput gains.

Deploying xLSTM 7B with Transformers on Spheron GPU Cloud

Prerequisites

You need:

A Spheron GPU instance (provision at app.spheron.ai)
Ubuntu 22.04 with NVIDIA drivers 535+
CUDA 12.x
Python 3.10+

Install

The recommended inference path for xLSTM 7B is the HuggingFace transformers library combined with the xlstm package, which provides the Triton kernels used during inference.

bash

pip install xlstm accelerate transformers

Verify the GPU is visible to PyTorch:

bash

python -c "import torch; print(torch.cuda.get_device_name(0))"

Download Model Weights

Important: Verify the current HuggingFace repository path before pulling. The NXAI organization on HuggingFace hosts the official xLSTM checkpoints. The repository ID is case-sensitive: use NX-AI/xLSTM-7b (lowercase b).

bash

# Repo ID is case-sensitive: NX-AI/xLSTM-7b (lowercase b)
huggingface-cli download NX-AI/xLSTM-7b \
  --local-dir ./xlstm-7b

Verify the download:

bash

ls -la ./xlstm-7b/
sha256sum ./xlstm-7b/*.safetensors  # compare against model card checksums

Single-GPU Inference

Load the model via the transformers AutoModelForCausalLM interface:

python

# infer_xlstm.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "./xlstm-7b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

prompt = "Summarize the key differences between xLSTM and standard transformers:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=300)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Run it with:

bash

python infer_xlstm.py

OpenAI-Compatible Serving

xLSTM does not have a built-in server binary. For production serving with an OpenAI-compatible endpoint, wrap the transformers model in a FastAPI app:

python

# serve_xlstm.py
import torch
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "./xlstm-7b"
app = FastAPI()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

class CompletionRequest(BaseModel):
    prompt: str
    max_tokens: int = 256

@app.post("/v1/completions")
def complete(req: CompletionRequest):
    inputs = tokenizer(req.prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=req.max_tokens)
    text = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return {"choices": [{"text": text}]}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8001)

Install dependencies and launch:

bash

pip install fastapi uvicorn
python serve_xlstm.py

Multi-GPU with Device Map

For 2-GPU deployments, set device_map="auto" and let accelerate split the model layers across GPUs:

python

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # distributes across all visible GPUs
)

The weight split covers model parameters only; the recurrent state is per-request and small regardless of how many GPUs are in use.

Long-Context Advantage

Unlike transformer deployments where long context forces large KV cache allocations (~69+ GB at 128K context, batch 4), xLSTM's memory budget stays fixed regardless of sequence length. The only VRAM consumers are the model weights and a bounded recurrent state buffer. You can process a 2K prompt and a 128K document with the same GPU allocation.

Test Request

bash

curl http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "xlstm-7b", "prompt": "Summarize the key differences between xLSTM and standard transformers:", "max_tokens": 300}'

Docker Variant

Package the serve_xlstm.py script above into a container:

dockerfile

FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
RUN pip install xlstm accelerate transformers fastapi uvicorn
COPY serve_xlstm.py /app/serve_xlstm.py
WORKDIR /app
CMD ["python", "serve_xlstm.py"]

bash

docker build -t xlstm-server .
docker run --gpus all --ipc=host --rm \
  -v $(pwd)/xlstm-7b:/app/xlstm-7b \
  -p 8001:8001 \
  xlstm-server

Deploying RWKV-7 World v3 with rwkv.cpp and ChatRWKV

Build rwkv.cpp from Source

bash

git clone https://github.com/RWKV/rwkv.cpp
cd rwkv.cpp
mkdir build && cd build

# Build with CUDA support
cmake .. -DGGML_CUDA=ON
make -j$(nproc)

This produces ./bin/rwkv and the Python shared library. CUDA 12+ and CMake 3.17+ are required. The build takes 3-8 minutes on a fresh instance.

Download Model Weights

bash

# Verify current repo path at https://huggingface.co/BlinkDL/rwkv-7-world
huggingface-cli download BlinkDL/rwkv-7-world \
  --local-dir ./rwkv7-world

Check the model card for the exact filename of the 7B BF16 checkpoint. Filenames in RWKV releases follow a pattern like RWKV-7-World-v3-7B-bf16.pth, but verify before using.

Quantize for Production

BF16 native weights give the best quality. For memory-constrained dev environments, quantize to INT8:

bash

# Only use INT8 for dev/test. Use BF16 for production.
./bin/rwkv quantize \
  ./rwkv7-world/RWKV-7-World-v3-7B-bf16.pth \
  ./rwkv7-world-q8.bin \
  q8_0

Note: The q8_0 format is specific to the rwkv.cpp quantizer and may not load correctly in ChatRWKV if the build versions differ. If you see loading errors with the quantized file, fall back to the native BF16 weights. Do not use INT8 quantization for production serving; the quality tradeoff and potential incompatibilities are not worth the VRAM savings on a 48 GB L40S where the 17 GB model fits comfortably.

Launch the ChatRWKV Demo Server

ChatRWKV's main files are demo scripts (API_DEMO.py, API_DEMO_CHAT.py, API_DEMO_WORLD.py, chat.py). For a quick test, use API_DEMO_CHAT.py:

bash

# Install ChatRWKV
git clone https://github.com/BlinkDL/ChatRWKV
cd ChatRWKV
pip install -r requirements.txt

# Run the chat demo (edit the model path inside the script)
python API_DEMO_CHAT.py

For a production OpenAI-compatible RWKV-7 endpoint, use the community ai00_rwkv_server project, which wraps RWKV inference in a proper HTTP server with /v1/chat/completions support.

Stateful Multi-Turn Conversations

This is RWKV-7's main production advantage over stateless transformer serving. The model's recurrent state after each user turn is a small tensor that you can save and resume using the rwkv Python package directly:

python

import os
import torch
from rwkv.model import RWKV
from rwkv.utils import PIPELINE

model = RWKV(model="./rwkv7-world/RWKV-7-World-v3-7B-bf16.pth", strategy="cuda bf16")
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

# First turn — state starts as None
tokens = pipeline.encode("Hello, how are you?")
out, state = model.forward(tokens, state=None)

# Save state to disk after the turn
os.makedirs("./sessions", exist_ok=True)
torch.save(state, "./sessions/user_123.bin")

# Resume from saved state on next request (map_location remaps tensors to the
# current device, so this works correctly after spot-instance preemption)
state = torch.load("./sessions/user_123.bin", map_location=torch.device("cuda"))
tokens = pipeline.encode("What did I just ask you?")
out, state = model.forward(tokens, state=state)

The state is a list of CUDA tensors, typically 50-200 MB depending on model size. For multi-user deployments, maintain a session store mapping user IDs to state file paths. This gives RWKV-7 a genuine stateful-chat advantage: the model retains context across sessions without any KV cache reconstruction overhead.

Note on rwkv7-g1: The BlinkDL/rwkv-7-world model card now also points to BlinkDL/rwkv7-g1 as the recommended upgrade ("fully compatible and better in all aspects"). If you are starting a new deployment, check both repos and prefer rwkv7-g1 if it fits your context length and quantization needs.

Test Request

bash

curl http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "rwkv-7-world", "messages": [{"role": "user", "content": "What are the architectural differences between RWKV and a standard transformer?"}], "max_tokens": 300}'

Throughput, TTFT, and Cost Per Million Tokens vs Llama 3.3 70B

These figures are estimates based on published linear-attention scaling behavior (xLSTM 7B paper, arXiv:2503.13427) and vLLM baseline measurements. Run your own benchmarks on your actual hardware before making capacity decisions.

A note on model size: xLSTM 7B and RWKV-7 7B are compared against Llama 3.3 70B in the cost table below. These are not equivalent parameter counts. The comparison is cost-tier: both 7B models and Llama 3.3 70B run on the same H100 SXM5 hardware, and the table shows the actual cost-per-token tradeoff at that GPU tier. Llama 3.3 70B is used as the transformer baseline because it represents a common production choice on H100.

Table 1: Throughput at varying context lengths (tokens/sec, single H100 SXM5)

Context Length	xLSTM 7B	RWKV-7 7B	Llama 3.3 70B	Note
2K tokens	~2,400 tok/s	~2,300 tok/s	~600 tok/s	Transformer faster at same parameter count
8K tokens	~2,350 tok/s	~2,250 tok/s	~380 tok/s	Linear-attention advantage growing
16K tokens	~2,300 tok/s	~2,200 tok/s	~160 tok/s	~14x advantage for linear models
64K tokens	~2,200 tok/s	~2,100 tok/s	~45 tok/s	~49x advantage
128K tokens	~2,100 tok/s	~2,050 tok/s	~15 tok/s	Transformer approaching unusable

At short context, the comparison is against a much larger model (70B vs 7B). At long context, even accounting for the size difference, linear-attention 7B models match or exceed Llama 3.3 70B throughput because the transformer's KV cache overhead at 128K context is severe. If you are running Llama 3.3 70B for long-document tasks specifically, the GPU cost per token is 100x higher than a linear-attention 7B at 128K context.

Table 2: Cost per million tokens at 32K context average, H100 SXM5

Model	GPU	Price/hr	Throughput (32K avg)	Cost/M tokens
xLSTM 7B BF16	H100 SXM5 (on-demand)	$3.10/hr	~2,250 tok/s	~$0.38/M
RWKV-7 7B BF16	H100 SXM5 (on-demand)	$3.10/hr	~2,150 tok/s	~$0.40/M
xLSTM 7B BF16	L40S PCIe (on-demand)	$0.72/hr	~1,350 tok/s	~$0.15/M
RWKV-7 7B BF16	A100 SXM4 (on-demand)	$1.70/hr	~1,800 tok/s	~$0.26/M
Llama 3.3 70B BF16	H100 SXM5 (on-demand)	$3.10/hr	~190 tok/s	~$4.53/M

Cost formula: (price_per_hour / 3600) / (throughput / 1_000_000)

At 32K context, xLSTM 7B on H100 is roughly 12x cheaper per token than Llama 3.3 70B on the same hardware. At 128K context, the gap is larger. If your workload is long-context summarization or retrieval over long documents, this cost difference matters.

Pricing fluctuates based on GPU availability. The prices above are based on 06 May 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader treatment of inference cost optimization, see the AI inference cost economics guide.

When to Choose Linear Attention: Workload Decision Matrix

Criterion	Use xLSTM or RWKV-7	Use Transformer
Typical context length	Over 16K tokens	Under 4K tokens
VRAM budget	Constrained (under 80 GB)	Flexible
Primary workload	Long-doc analysis, summarization	Short-form generation, complex reasoning
Stateful multi-turn serving	Yes (RWKV-7 advantage)	Requires KV cache reconstruction
Fine-tuning needed	Limited (not yet production-ready)	Full ecosystem (LoRA, PEFT, Axolotl)
Serving framework	transformers + xlstm, rwkv.cpp (custom)	vLLM, SGLang, TensorRT-LLM
Ecosystem maturity	2026 release, early adopter stage	Mature, well-tooled
Cold-start latency	xLSTM: seconds (transformers); RWKV-7: seconds	Seconds

The main constraint on linear-attention models today is ecosystem maturity. vLLM, SGLang, and TensorRT-LLM are not available for xLSTM or RWKV-7 at production quality as of May 2026. Custom runtimes work, but your team needs to maintain them. That is a real operational cost to factor in before switching off transformers.

For pure inference at long context on stable documents, the GPU cost savings are significant and the deployment complexity is manageable. For anything requiring fine-tuning, multi-step tool use, or production-grade serving tooling, a transformer is the safer choice today.

Production Gotchas: State Checkpointing, Batching, Multi-Tenant Serving

State Checkpointing

Unlike transformers, linear-attention models carry inference state between requests in stateful serving. This is different from transformer KV caches, which are request-scoped and discarded at the end of each completion.

For RWKV-7: use the rwkv Python package to capture the model's recurrent state after each turn and serialize it with torch.save (see the deployment section above). State files are 50-200 MB depending on model size. On spot instances where preemptions are possible, flush the state file to durable storage before each response to avoid losing conversation context on restart.

For xLSTM: the transformers library exposes the recurrent state through generation outputs. Checkpoint it with torch.save for preemption recovery:

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "./xlstm-7b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, torch_dtype=torch.bfloat16, device_map="cuda"
)

# Generate and capture recurrent state
inputs = tokenizer("Describe the xLSTM architecture:", return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model(**inputs, use_cache=True)
state = out.past_key_values  # recurrent state (xLSTM equivalent)

# Checkpoint state to disk
torch.save(state, "./checkpoint.pt")

# On resume — load state and continue generation (map_location remaps tensors
# to the current device, so this works correctly after spot-instance preemption)
state = torch.load("./checkpoint.pt", map_location=torch.device("cuda"))

Verify the exact state API by reading the model's documentation, as the attribute name may differ between xLSTM transformers releases.

Batching Behavior

Linear-attention models do not share state across batch items the way transformers share KV cache in continuous batching systems.

Each batch item maintains an independent recurrent state. This means batch size scales VRAM linearly, not with any shared overhead. At batch size 16, xLSTM 7B on H100 uses approximately 16 GB for model weights plus 16 independent state copies. Both fit within H100's 80 GB HBM at typical batch sizes.

The throughput benefit at high batch sizes is significant for linear-attention models: there is no attention computation that grows with sequence length, so the GPU stays at near-constant throughput per token regardless of context. At batch size 16 and 64K context, xLSTM and RWKV-7 deliver roughly linear throughput scaling. A transformer at the same batch size and context is fighting quadratic attention overhead.

Multi-Tenant Serving

Standard transformer serving frameworks (vLLM, SGLang) use continuous batching with a shared KV cache pool. This does not directly apply to linear-attention models because they have per-request state instead of shared KV caches.

For xLSTM: the transformers-based server in this guide handles per-request state through independent forward passes. Each request in the FastAPI handler gets its own state; verify isolation behavior in the documentation before relying on it for strict multi-tenant isolation.

For RWKV-7: ChatRWKV's API server manages per-request state by default. Each request gets an isolated state context.

To route traffic across both servers from a single endpoint, use LiteLLM proxy:

yaml

# litellm_config.yaml
model_list:
  - model_name: xlstm-7b
    litellm_params:
      model: openai/xlstm-7b
      api_base: http://localhost:8001
  - model_name: rwkv-7-world
    litellm_params:
      model: openai/rwkv-7-world
      api_base: http://localhost:8002

See the AI gateway guide for the full LiteLLM setup including authentication, rate limiting, and cost tracking across multiple model endpoints.

Getting Started with xLSTM and RWKV-7 on Spheron

Summary of the GPU configurations covered in this guide:

Workload	GPU	On-Demand	Spot	Notes
xLSTM 7B, dev/test	L40S PCIe	$0.72/hr	N/A	Good cost for single-user
xLSTM 7B, production serving	H100 SXM5	$3.10/hr	N/A	Best compute throughput
RWKV-7 7B, stateful chat	A100 SXM4	$1.70/hr	N/A	State checkpointing supported
RWKV-7 + xLSTM mixed fleet	H100 SXM5	$3.10/hr	N/A	Run both servers on same instance
Long-context transformer + linear-attention fleet	H200 SXM5	$2.51/hr	$1.19/hr	H200's bandwidth helps mixed workloads

Pricing fluctuates based on GPU availability. The prices above are based on 06 May 2026 and may have changed. Check current GPU pricing → for live rates.

Quick start steps:

Provision a Spheron GPU instance at app.spheron.ai. Pick L40S for development, A100 SXM4 for production RWKV-7, or H100 SXM5 for high-throughput xLSTM.
SSH in and verify your GPU with nvidia-smi. Confirm CUDA 12+ is installed.
For xLSTM: pip install xlstm accelerate transformers fastapi uvicorn
For RWKV-7: git clone https://github.com/RWKV/rwkv.cpp && cd rwkv.cpp && mkdir build && cd build && cmake .. -DGGML_CUDA=ON && make -j$(nproc)
Download model weights from HuggingFace, verify against model card checksums, then launch the inference server.

Check docs.spheron.ai for deployment templates and instance configuration guides.

xLSTM and RWKV-7 change which GPU tier makes sense for long-context workloads. Spheron's bare-metal H100 and H200 instances give you the recurrent state access and memory control that shared serverless platforms restrict. Spot pricing keeps experimentation costs low before committing to reserved capacity.
H100 pricing on Spheron → | H200 GPU pricing → | View all GPU pricing →
Get started on Spheron →

STEPS / 05

Quick Setup Guide

Provision a GPU instance on Spheron
Log into app.spheron.ai. For xLSTM 7B or RWKV-7 7B, select an L40S (48 GB) or A100 SXM4 (80 GB) instance. For higher throughput or larger model variants, select H100 SXM5. Deploy with an Ubuntu 22.04 image and the NVIDIA Docker runtime template. Verify the instance with nvidia-smi before proceeding.
Install the runtime for your chosen model
For xLSTM: pip install xlstm accelerate transformers (uses the HuggingFace transformers library with the xlstm package and Triton kernels). For RWKV-7 with rwkv.cpp: git clone https://github.com/RWKV/rwkv.cpp && cd rwkv.cpp && mkdir build && cd build && cmake .. -DGGML_CUDA=ON && make -j$(nproc). Both require CUDA 12+ and Python 3.10+.
Download model weights from HuggingFace
For xLSTM 7B: huggingface-cli download NX-AI/xLSTM-7b --local-dir ./xlstm-7b. For RWKV-7 World v3: huggingface-cli download BlinkDL/rwkv-7-world --local-dir ./rwkv7-world. Verify the download with sha256sum against the model card checksums. Large models may take 10-20 minutes on first pull.
Launch the inference server
For xLSTM, write a FastAPI wrapper that loads the model via the transformers library (see the deployment section of this guide for a working example). For RWKV-7, launch ChatRWKV's demo server: python ChatRWKV/API_DEMO_CHAT.py --model ./rwkv7-world/RWKV-7-World-v3-7B-bf16.pth --strategy cuda bf16. For a production OpenAI-compatible RWKV-7 endpoint, use the ai00_rwkv_server community project.
Run a baseline benchmark
Benchmark xLSTM token throughput by sending repeated generation requests to your FastAPI server and measuring tokens per second at 2K, 8K, 16K, and 64K context lengths. Use wrk or a simple Python script to send concurrent requests and measure wall-clock time. Compare output tokens per second at each length against a transformer baseline to validate the linear-attention throughput advantage.

FAQ / 05

Frequently Asked Questions

xLSTM 7B at BF16 requires approximately 15-17 GB for model weights plus a fixed recurrent memory state of 1-3 GB, totaling around 18-20 GB. This fits comfortably on a single L40S (48 GB) or A100 (80 GB). Because xLSTM's memory state is fixed-size, VRAM does not grow with context length the way a transformer KV cache does. At BF16 you can run unlimited context on the same GPU that handles 2K-token inference.

RWKV-7 uses the rwkv.cpp or ChatRWKV runtime rather than a PyTorch inference server. Its recurrent state is smaller and simpler to checkpoint than xLSTM's memory matrix, making RWKV-7 easier to deploy for stateful multi-turn conversations. xLSTM uses the HuggingFace transformers library with the xlstm package; cold starts are similar to other transformer models (seconds, not minutes). RWKV-7's rwkv.cpp has near-instant cold starts and can run on CPU as a fallback, though GPU inference is required for throughput above ~50 tokens/second.

For RWKV-7 World v3 at 7B parameters and BF16, an L40S (48 GB) or A100 PCIe (80 GB) covers all context lengths without VRAM pressure. For high-concurrency serving (batch size 16+), the H100 SXM5's compute throughput reduces latency meaningfully. RWKV-7 is more memory-bandwidth-bound than compute-bound at short context, so GPUs with high HBM bandwidth (H200, H100 SXM5) outperform GDDR6 GPUs (L40S) at batch sizes above 8.

At short contexts (under 4K tokens), transformers are roughly equivalent or slightly faster than xLSTM and RWKV-7 because their CUDA-optimized attention kernels (FlashAttention-3) are highly tuned. From 16K tokens upward, linear-attention models maintain near-constant throughput while transformer throughput degrades quadratically. At 128K tokens, xLSTM and RWKV-7 can deliver 8-15x more tokens per second than a same-size Llama transformer on identical hardware, because there is no KV cache to fill and no attention over the full context window.

Not currently. xLSTM uses the HuggingFace transformers library with the xlstm package, but it is not supported by vLLM or SGLang as of May 2026. RWKV-7 has limited vLLM integration; the production-ready path uses rwkv.cpp or ChatRWKV. Both can be wrapped with a FastAPI layer to expose an OpenAI-compatible endpoint, so downstream tooling (LangChain, LiteLLM, etc.) works without changes. Check model cards on HuggingFace for the latest framework support status before deploying.

Why Linear-Attention Architectures Matter in 2026

xLSTM vs RWKV-7 vs Mamba-3: Architecture Comparison

xLSTM: Matrix Memory Cells and Exponential Gating

RWKV-7: Time-Mixing with Linear Attention

Hardware Sizing: VRAM, Recurrent State, and GPU Selection

Deploying xLSTM 7B with Transformers on Spheron GPU Cloud

Prerequisites

Install

Download Model Weights

Single-GPU Inference

OpenAI-Compatible Serving

Multi-GPU with Device Map

Long-Context Advantage

Test Request

Docker Variant

Deploying RWKV-7 World v3 with rwkv.cpp and ChatRWKV

Build rwkv.cpp from Source

Download Model Weights

Quantize for Production

Launch the ChatRWKV Demo Server

Stateful Multi-Turn Conversations

Test Request

Throughput, TTFT, and Cost Per Million Tokens vs Llama 3.3 70B

When to Choose Linear Attention: Workload Decision Matrix

Production Gotchas: State Checkpointing, Batching, Multi-Tenant Serving

State Checkpointing

Batching Behavior

Multi-Tenant Serving

Getting Started with xLSTM and RWKV-7 on Spheron

Quick Setup Guide

Provision a GPU instance on Spheron

Install the runtime for your chosen model

Download model weights from HuggingFace

Launch the inference server

Run a baseline benchmark

Frequently Asked Questions

01How much VRAM does xLSTM 7B require for inference?

02How does RWKV-7 differ from xLSTM in inference deployment?

03What GPU should I use for RWKV-7 World v3 inference?

04How does linear-attention throughput compare to transformers at long context?

05Can xLSTM and RWKV-7 run with the same vLLM setup as Llama or Mistral models?

Try It on Real GPUs