Engineering

Deploy xLSTM and RWKV-7 on GPU Cloud: Linear-Attention Alternatives for Million-Token Context Inference (2026)

xLSTM GPU cloud deploymentRWKV-7 inference deploymentlinear attention LLM servingxLSTM vs RWKV-7xLSTM 7BRWKV-7 World v3Long-Context Linear AttentionGPU CloudLLM InferenceAI Infrastructure
Deploy xLSTM and RWKV-7 on GPU Cloud: Linear-Attention Alternatives for Million-Token Context Inference (2026)

Most inference guides assume transformers. For long-context workloads, that assumption is becoming expensive.

State space models like Mamba-3 showed that recurrent-state architectures can outperform transformers on million-token context tasks at a fraction of the GPU cost. Two other architectures released in 2026 belong in that same conversation: xLSTM 7B from NXAI and RWKV-7 World v3 from BlinkDL. Both use linear-attention mechanisms, both maintain fixed-size recurrent state that does not grow with context length, and both have production-ready inference runtimes that work today. Our Mamba-3 GPU deployment guide covers the SSM baseline; this post covers the two highest-profile linear-attention additions of 2026. Liquid AI's LFM family is a third non-transformer architecture worth evaluating alongside xLSTM and RWKV-7 - see the Liquid Foundation Model deployment guide for the production setup.

A fourth architecture that belongs in this conversation is TTT (Test-Time Training), which takes the fixed-state idea one step further: the hidden state is itself a small gradient-updated model. See the TTT deployment guide for how TTT-Linear compares to xLSTM and Mamba on long-context GPU inference. A fifth worth tracking is SubQ: unlike xLSTM and RWKV-7, which use fixed-size recurrent state, SubQ retains a growing KV cache but makes it linear rather than quadratic. The SubQ 1M-Preview deployment guide covers how that distinction plays out in VRAM sizing and TTFT at 12M token context.

This guide walks through VRAM sizing, runtime setup with the transformers + xlstm package and rwkv.cpp, throughput benchmarks at multiple context lengths, and a cost-per-million-token comparison against Llama 3.3 70B. For context on the memory bandwidth bottleneck that makes these architectures attractive, see the AI memory wall inference guide.

Why Linear-Attention Architectures Matter in 2026

The transformer KV cache has a fundamental scaling problem. Every new token added to a sequence requires storing key-value pairs for all previous tokens, and the VRAM required grows quadratically with sequence length.

A 7B transformer at BF16, processing a 128K-token sequence at batch size 4, generates a KV cache of approximately:

kv_cache_gb = 2 x 32 layers x 8 kv_heads x 128 head_dim x 131072 seq_len x 4 batch x 2 bytes / 1e9
            ≈ 69 GB

That is ~69 GB of KV cache on top of the ~15 GB of model weights, totalling ~84 GB. A single H100 SXM5 with 80 GB of HBM cannot fit this. You either need KV cache eviction, NVMe offloading, or a fundamentally different architecture. For the transformer-side mitigations, see the KV cache optimization guide.

Linear-attention architectures take the alternative path. Instead of storing every past token in a cache, they compress the sequence history into a fixed-size recurrent state. For xLSTM, this is a matrix memory cell with exponential gating. For RWKV-7, it is a time-mixing mechanism with linear attention and a token-shift operation. The recurrent state size for a 7B model is 2-4 GB regardless of whether you have processed 2K tokens or 2M tokens.

The GPU economics flip at long context. A model that needs 6x the compute of a transformer at 2K context may need half the compute at 64K, because the transformer is drowning under its KV cache while the linear-attention model operates at constant memory overhead.

xLSTM vs RWKV-7 vs Mamba-3: Architecture Comparison

All three are linear-attention architectures in the sense that their inference complexity scales linearly with sequence length, not quadratically. Beyond that, they differ significantly in how they implement the recurrent state.

PropertyxLSTM 7BRWKV-7 World v3Mamba-3
Architecture typeExtended LSTM with matrix memoryLinear attention with time-mixingSelective state space (SSM)
State size (7B BF16)low single-digit GB (fixed)low single-digit GB (fixed)low single-digit GB (fixed)
Context scalingO(1) VRAM, O(n) computeO(1) VRAM, O(n) computeO(1) VRAM, O(n) compute
VRAM overhead at 128K contextFixed (no growth)Fixed (no growth)Fixed (no growth)
Primary 2026 releaseNX-AI/xLSTM-7bBlinkDL/rwkv-7-worldstate-spaces/mamba-3
Primary runtimetransformers + xlstmrwkv.cpp, ChatRWKVvLLM 0.5+
vLLM supportNot supported (May 2026)LimitedFull support
FrameworkPyTorch (transformers + xlstm)C++ GGML + PythonPyTorch

xLSTM: Matrix Memory Cells and Exponential Gating

xLSTM extends the classic LSTM by replacing the scalar cell state with a matrix memory. Each layer maintains a matrix C_t that stores compressed representations of past tokens. On each new token, the model computes an exponential gate that controls how much new information updates the matrix versus how much old information is retained.

The matrix structure gives xLSTM more representational capacity than classic LSTM or GRU variants. The exponential gating (using exp(q_t^T k_t) rather than sigmoid gates) is stable in BF16 and gives the model finer control over information retention across very long sequences.

The inference runtime is the HuggingFace transformers library combined with the xlstm package, which provides Triton-based GPU kernels. This avoids JAX/XLA compilation overhead and loads at the same speed as any other transformers model.

RWKV-7: Time-Mixing with Linear Attention

RWKV-7 uses a different mechanism called time-mixing. Each layer applies a learned time-decay to the recurrent state, controlling how quickly past information fades. The token-shift mechanism adds a mix of the previous token's representation into the current token's computation, giving the model a two-step temporal context without full attention.

RWKV-7 World v3 is the production-ready variant, trained on a multilingual corpus with strong instruction following. The runtime is rwkv.cpp (for quantized inference) or ChatRWKV (for the Python serving stack). Both are simpler to deploy for quantized inference than the transformers approach because they use GGML kernels with tight memory control.

The key production advantage of RWKV-7 is stateful multi-turn serving: the model's recurrent state after processing a user's message can be saved and resumed on the next turn. A transformer serving system with KV cache does something similar, but transformer KV caches are proportional to context length. RWKV-7's state is fixed-size regardless of how many turns have passed.

For the broader Mamba-3 architecture and SSM background, see the Mamba-3 deployment guide.

Hardware Sizing: VRAM, Recurrent State, and GPU Selection

The VRAM formula for linear-attention models follows the same pattern as Mamba-3 but with different state overhead:

vram_gb = (params_billions x bytes_per_dtype x 1.07) + state_gb

For xLSTM 7B at BF16: (7 x 2 x 1.07) + state_gb ≈ 15-18 GB (state is low single-digit GB)

For RWKV-7 7B at BF16: (7 x 2 x 1.07) + state_gb ≈ 15-17 GB (state is low single-digit GB)

The 1.07 overhead factor covers activations and runtime buffers. Linear-attention models use a smaller buffer overhead than transformers (which use ~1.15) because there is no KV cache to reserve headroom for. State size is constant regardless of sequence length.

For comparison, a transformer 7B at BF16, processing a 16K context at batch size 4, needs approximately 16 GB weights plus 17 GB KV cache, for ~33 GB total. For more on GPU VRAM math for standard LLMs, see the GPU memory requirements guide.

ModelParamsPrecisionVRAM (Weights + State)Minimum GPUContext Limit
xLSTM 7B7BBF16~18 GBL40S 48 GBUnlimited (fixed state)
xLSTM 7B7BFP8~10 GBL40S 48 GBUnlimited (fixed state)
RWKV-7 7B7BBF16~17 GBL40S 48 GBUnlimited (fixed state)
RWKV-7 7B7BINT8~9 GBL40S 48 GBUnlimited (fixed state)
Llama 3.1 8B8BBF16~16 GB + KV cacheL40S 48 GB~32K before pressure

The "Unlimited" context limit for linear-attention models is literal: VRAM does not grow with sequence length. The same L40S that handles a 2K conversation handles a 128K document analysis without any configuration change. Compare that to the transformer 7B, which needs KV cache management strategies above 32K on a 48 GB GPU.

GPU tier recommendations:

Use CaseRecommended GPUOn-Demand PriceSpot Price
xLSTM 7B or RWKV-7 7B, dev/testL40S PCIe$0.72/hrN/A
xLSTM 7B or RWKV-7 7B, productionA100 SXM4 80 GB$1.70/hrN/A
High-throughput serving, batch 8+H100 SXM5 on Spheron$3.10/hrN/A
Mixed transformer + linear-attention fleetH200 GPU rental$2.51/hr$1.19/hr

Pricing fluctuates based on GPU availability. The prices above are based on 06 May 2026 and may have changed. Check current GPU pricing → for live rates.

The H100 vs H200 tradeoff for these workloads mirrors the Mamba-3 case. Both xLSTM and RWKV-7 are more compute-bound than memory-bandwidth-bound at long context, because the state update is a matrix multiplication over a fixed-size buffer rather than a streaming read of a large KV cache. H200's 4.8 TB/s bandwidth premium over H100's 3.35 TB/s matters less here. For pure linear-attention serving, H100 gives better price-to-compute than H200. H200 becomes the right pick if you are running a mixed fleet that also includes long-context transformer workloads, where H200's bandwidth advantage directly translates to throughput gains.

Deploying xLSTM 7B with Transformers on Spheron GPU Cloud

Prerequisites

You need:

  • A Spheron GPU instance (provision at app.spheron.ai)
  • Ubuntu 22.04 with NVIDIA drivers 535+
  • CUDA 12.x
  • Python 3.10+

Install

The recommended inference path for xLSTM 7B is the HuggingFace transformers library combined with the xlstm package, which provides the Triton kernels used during inference.

bash
pip install xlstm accelerate transformers

Verify the GPU is visible to PyTorch:

bash
python -c "import torch; print(torch.cuda.get_device_name(0))"

Download Model Weights

Important: Verify the current HuggingFace repository path before pulling. The NXAI organization on HuggingFace hosts the official xLSTM checkpoints. The repository ID is case-sensitive: use NX-AI/xLSTM-7b (lowercase b).

bash
# Repo ID is case-sensitive: NX-AI/xLSTM-7b (lowercase b)
huggingface-cli download NX-AI/xLSTM-7b \
  --local-dir ./xlstm-7b

Verify the download:

bash
ls -la ./xlstm-7b/
sha256sum ./xlstm-7b/*.safetensors  # compare against model card checksums

Single-GPU Inference

Load the model via the transformers AutoModelForCausalLM interface:

python
# infer_xlstm.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "./xlstm-7b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

prompt = "Summarize the key differences between xLSTM and standard transformers:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=300)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Run it with:

bash
python infer_xlstm.py

OpenAI-Compatible Serving

xLSTM does not have a built-in server binary. For production serving with an OpenAI-compatible endpoint, wrap the transformers model in a FastAPI app:

python
# serve_xlstm.py
import torch
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "./xlstm-7b"
app = FastAPI()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

class CompletionRequest(BaseModel):
    prompt: str
    max_tokens: int = 256

@app.post("/v1/completions")
def complete(req: CompletionRequest):
    inputs = tokenizer(req.prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=req.max_tokens)
    text = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return {"choices": [{"text": text}]}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8001)

Install dependencies and launch:

bash
pip install fastapi uvicorn
python serve_xlstm.py

Multi-GPU with Device Map

For 2-GPU deployments, set device_map="auto" and let accelerate split the model layers across GPUs:

python
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # distributes across all visible GPUs
)

The weight split covers model parameters only; the recurrent state is per-request and small regardless of how many GPUs are in use.

Long-Context Advantage

Unlike transformer deployments where long context forces large KV cache allocations (~69+ GB at 128K context, batch 4), xLSTM's memory budget stays fixed regardless of sequence length. The only VRAM consumers are the model weights and a bounded recurrent state buffer. You can process a 2K prompt and a 128K document with the same GPU allocation.

Test Request

bash
curl http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "xlstm-7b", "prompt": "Summarize the key differences between xLSTM and standard transformers:", "max_tokens": 300}'

Docker Variant

Package the serve_xlstm.py script above into a container:

dockerfile
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
RUN pip install xlstm accelerate transformers fastapi uvicorn
COPY serve_xlstm.py /app/serve_xlstm.py
WORKDIR /app
CMD ["python", "serve_xlstm.py"]
bash
docker build -t xlstm-server .
docker run --gpus all --ipc=host --rm \
  -v $(pwd)/xlstm-7b:/app/xlstm-7b \
  -p 8001:8001 \
  xlstm-server

Deploying RWKV-7 World v3 with rwkv.cpp and ChatRWKV

Build rwkv.cpp from Source

bash
git clone https://github.com/RWKV/rwkv.cpp
cd rwkv.cpp
mkdir build && cd build

# Build with CUDA support
cmake .. -DGGML_CUDA=ON
make -j$(nproc)

This produces ./bin/rwkv and the Python shared library. CUDA 12+ and CMake 3.17+ are required. The build takes 3-8 minutes on a fresh instance.

Download Model Weights

bash
# Verify current repo path at https://huggingface.co/BlinkDL/rwkv-7-world
huggingface-cli download BlinkDL/rwkv-7-world \
  --local-dir ./rwkv7-world

Check the model card for the exact filename of the 7B BF16 checkpoint. Filenames in RWKV releases follow a pattern like RWKV-7-World-v3-7B-bf16.pth, but verify before using.

Quantize for Production

BF16 native weights give the best quality. For memory-constrained dev environments, quantize to INT8:

bash
# Only use INT8 for dev/test. Use BF16 for production.
./bin/rwkv quantize \
  ./rwkv7-world/RWKV-7-World-v3-7B-bf16.pth \
  ./rwkv7-world-q8.bin \
  q8_0

Note: The q8_0 format is specific to the rwkv.cpp quantizer and may not load correctly in ChatRWKV if the build versions differ. If you see loading errors with the quantized file, fall back to the native BF16 weights. Do not use INT8 quantization for production serving; the quality tradeoff and potential incompatibilities are not worth the VRAM savings on a 48 GB L40S where the 17 GB model fits comfortably.

Launch the ChatRWKV Demo Server

ChatRWKV's main files are demo scripts (API_DEMO.py, API_DEMO_CHAT.py, API_DEMO_WORLD.py, chat.py). For a quick test, use API_DEMO_CHAT.py:

bash
# Install ChatRWKV
git clone https://github.com/BlinkDL/ChatRWKV
cd ChatRWKV
pip install -r requirements.txt

# Run the chat demo (edit the model path inside the script)
python API_DEMO_CHAT.py

For a production OpenAI-compatible RWKV-7 endpoint, use the community ai00_rwkv_server project, which wraps RWKV inference in a proper HTTP server with /v1/chat/completions support.

Stateful Multi-Turn Conversations

This is RWKV-7's main production advantage over stateless transformer serving. The model's recurrent state after each user turn is a small tensor that you can save and resume using the rwkv Python package directly:

python
import os
import torch
from rwkv.model import RWKV
from rwkv.utils import PIPELINE

model = RWKV(model="./rwkv7-world/RWKV-7-World-v3-7B-bf16.pth", strategy="cuda bf16")
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

# First turn — state starts as None
tokens = pipeline.encode("Hello, how are you?")
out, state = model.forward(tokens, state=None)

# Save state to disk after the turn
os.makedirs("./sessions", exist_ok=True)
torch.save(state, "./sessions/user_123.bin")

# Resume from saved state on next request (map_location remaps tensors to the
# current device, so this works correctly after spot-instance preemption)
state = torch.load("./sessions/user_123.bin", map_location=torch.device("cuda"))
tokens = pipeline.encode("What did I just ask you?")
out, state = model.forward(tokens, state=state)

The state is a list of CUDA tensors, typically 50-200 MB depending on model size. For multi-user deployments, maintain a session store mapping user IDs to state file paths. This gives RWKV-7 a genuine stateful-chat advantage: the model retains context across sessions without any KV cache reconstruction overhead.

Note on rwkv7-g1: The BlinkDL/rwkv-7-world model card now also points to BlinkDL/rwkv7-g1 as the recommended upgrade ("fully compatible and better in all aspects"). If you are starting a new deployment, check both repos and prefer rwkv7-g1 if it fits your context length and quantization needs.

Test Request

bash
curl http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "rwkv-7-world", "messages": [{"role": "user", "content": "What are the architectural differences between RWKV and a standard transformer?"}], "max_tokens": 300}'

Throughput, TTFT, and Cost Per Million Tokens vs Llama 3.3 70B

These figures are estimates based on published linear-attention scaling behavior (xLSTM 7B paper, arXiv:2503.13427) and vLLM baseline measurements. Run your own benchmarks on your actual hardware before making capacity decisions.

A note on model size: xLSTM 7B and RWKV-7 7B are compared against Llama 3.3 70B in the cost table below. These are not equivalent parameter counts. The comparison is cost-tier: both 7B models and Llama 3.3 70B run on the same H100 SXM5 hardware, and the table shows the actual cost-per-token tradeoff at that GPU tier. Llama 3.3 70B is used as the transformer baseline because it represents a common production choice on H100.

Table 1: Throughput at varying context lengths (tokens/sec, single H100 SXM5)

Context LengthxLSTM 7BRWKV-7 7BLlama 3.3 70BNote
2K tokens~2,400 tok/s~2,300 tok/s~600 tok/sTransformer faster at same parameter count
8K tokens~2,350 tok/s~2,250 tok/s~380 tok/sLinear-attention advantage growing
16K tokens~2,300 tok/s~2,200 tok/s~160 tok/s~14x advantage for linear models
64K tokens~2,200 tok/s~2,100 tok/s~45 tok/s~49x advantage
128K tokens~2,100 tok/s~2,050 tok/s~15 tok/sTransformer approaching unusable

At short context, the comparison is against a much larger model (70B vs 7B). At long context, even accounting for the size difference, linear-attention 7B models match or exceed Llama 3.3 70B throughput because the transformer's KV cache overhead at 128K context is severe. If you are running Llama 3.3 70B for long-document tasks specifically, the GPU cost per token is 100x higher than a linear-attention 7B at 128K context.

Table 2: Cost per million tokens at 32K context average, H100 SXM5

ModelGPUPrice/hrThroughput (32K avg)Cost/M tokens
xLSTM 7B BF16H100 SXM5 (on-demand)$3.10/hr~2,250 tok/s~$0.38/M
RWKV-7 7B BF16H100 SXM5 (on-demand)$3.10/hr~2,150 tok/s~$0.40/M
xLSTM 7B BF16L40S PCIe (on-demand)$0.72/hr~1,350 tok/s~$0.15/M
RWKV-7 7B BF16A100 SXM4 (on-demand)$1.70/hr~1,800 tok/s~$0.26/M
Llama 3.3 70B BF16H100 SXM5 (on-demand)$3.10/hr~190 tok/s~$4.53/M

Cost formula: (price_per_hour / 3600) / (throughput / 1_000_000)

At 32K context, xLSTM 7B on H100 is roughly 12x cheaper per token than Llama 3.3 70B on the same hardware. At 128K context, the gap is larger. If your workload is long-context summarization or retrieval over long documents, this cost difference matters.

Pricing fluctuates based on GPU availability. The prices above are based on 06 May 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader treatment of inference cost optimization, see the AI inference cost economics guide.

When to Choose Linear Attention: Workload Decision Matrix

CriterionUse xLSTM or RWKV-7Use Transformer
Typical context lengthOver 16K tokensUnder 4K tokens
VRAM budgetConstrained (under 80 GB)Flexible
Primary workloadLong-doc analysis, summarizationShort-form generation, complex reasoning
Stateful multi-turn servingYes (RWKV-7 advantage)Requires KV cache reconstruction
Fine-tuning neededLimited (not yet production-ready)Full ecosystem (LoRA, PEFT, Axolotl)
Serving frameworktransformers + xlstm, rwkv.cpp (custom)vLLM, SGLang, TensorRT-LLM
Ecosystem maturity2026 release, early adopter stageMature, well-tooled
Cold-start latencyxLSTM: seconds (transformers); RWKV-7: secondsSeconds

The main constraint on linear-attention models today is ecosystem maturity. vLLM, SGLang, and TensorRT-LLM are not available for xLSTM or RWKV-7 at production quality as of May 2026. Custom runtimes work, but your team needs to maintain them. That is a real operational cost to factor in before switching off transformers.

For pure inference at long context on stable documents, the GPU cost savings are significant and the deployment complexity is manageable. For anything requiring fine-tuning, multi-step tool use, or production-grade serving tooling, a transformer is the safer choice today.

Production Gotchas: State Checkpointing, Batching, Multi-Tenant Serving

State Checkpointing

Unlike transformers, linear-attention models carry inference state between requests in stateful serving. This is different from transformer KV caches, which are request-scoped and discarded at the end of each completion.

For RWKV-7: use the rwkv Python package to capture the model's recurrent state after each turn and serialize it with torch.save (see the deployment section above). State files are 50-200 MB depending on model size. On spot instances where preemptions are possible, flush the state file to durable storage before each response to avoid losing conversation context on restart.

For xLSTM: the transformers library exposes the recurrent state through generation outputs. Checkpoint it with torch.save for preemption recovery:

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "./xlstm-7b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, torch_dtype=torch.bfloat16, device_map="cuda"
)

# Generate and capture recurrent state
inputs = tokenizer("Describe the xLSTM architecture:", return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model(**inputs, use_cache=True)
state = out.past_key_values  # recurrent state (xLSTM equivalent)

# Checkpoint state to disk
torch.save(state, "./checkpoint.pt")

# On resume — load state and continue generation (map_location remaps tensors
# to the current device, so this works correctly after spot-instance preemption)
state = torch.load("./checkpoint.pt", map_location=torch.device("cuda"))

Verify the exact state API by reading the model's documentation, as the attribute name may differ between xLSTM transformers releases.

Batching Behavior

Linear-attention models do not share state across batch items the way transformers share KV cache in continuous batching systems.

Each batch item maintains an independent recurrent state. This means batch size scales VRAM linearly, not with any shared overhead. At batch size 16, xLSTM 7B on H100 uses approximately 16 GB for model weights plus 16 independent state copies. Both fit within H100's 80 GB HBM at typical batch sizes.

The throughput benefit at high batch sizes is significant for linear-attention models: there is no attention computation that grows with sequence length, so the GPU stays at near-constant throughput per token regardless of context. At batch size 16 and 64K context, xLSTM and RWKV-7 deliver roughly linear throughput scaling. A transformer at the same batch size and context is fighting quadratic attention overhead.

Multi-Tenant Serving

Standard transformer serving frameworks (vLLM, SGLang) use continuous batching with a shared KV cache pool. This does not directly apply to linear-attention models because they have per-request state instead of shared KV caches.

For xLSTM: the transformers-based server in this guide handles per-request state through independent forward passes. Each request in the FastAPI handler gets its own state; verify isolation behavior in the documentation before relying on it for strict multi-tenant isolation.

For RWKV-7: ChatRWKV's API server manages per-request state by default. Each request gets an isolated state context.

To route traffic across both servers from a single endpoint, use LiteLLM proxy:

yaml
# litellm_config.yaml
model_list:
  - model_name: xlstm-7b
    litellm_params:
      model: openai/xlstm-7b
      api_base: http://localhost:8001
  - model_name: rwkv-7-world
    litellm_params:
      model: openai/rwkv-7-world
      api_base: http://localhost:8002

See the AI gateway guide for the full LiteLLM setup including authentication, rate limiting, and cost tracking across multiple model endpoints.

Getting Started with xLSTM and RWKV-7 on Spheron

Summary of the GPU configurations covered in this guide:

WorkloadGPUOn-DemandSpotNotes
xLSTM 7B, dev/testL40S PCIe$0.72/hrN/AGood cost for single-user
xLSTM 7B, production servingH100 SXM5$3.10/hrN/ABest compute throughput
RWKV-7 7B, stateful chatA100 SXM4$1.70/hrN/AState checkpointing supported
RWKV-7 + xLSTM mixed fleetH100 SXM5$3.10/hrN/ARun both servers on same instance
Long-context transformer + linear-attention fleetH200 SXM5$2.51/hr$1.19/hrH200's bandwidth helps mixed workloads

Pricing fluctuates based on GPU availability. The prices above are based on 06 May 2026 and may have changed. Check current GPU pricing → for live rates.

Quick start steps:

  1. Provision a Spheron GPU instance at app.spheron.ai. Pick L40S for development, A100 SXM4 for production RWKV-7, or H100 SXM5 for high-throughput xLSTM.
  2. SSH in and verify your GPU with nvidia-smi. Confirm CUDA 12+ is installed.
  3. For xLSTM: pip install xlstm accelerate transformers fastapi uvicorn
  4. For RWKV-7: git clone https://github.com/RWKV/rwkv.cpp && cd rwkv.cpp && mkdir build && cd build && cmake .. -DGGML_CUDA=ON && make -j$(nproc)
  5. Download model weights from HuggingFace, verify against model card checksums, then launch the inference server.

Check docs.spheron.ai for deployment templates and instance configuration guides.


xLSTM and RWKV-7 change which GPU tier makes sense for long-context workloads. Spheron's bare-metal H100 and H200 instances give you the recurrent state access and memory control that shared serverless platforms restrict. Spot pricing keeps experimentation costs low before committing to reserved capacity.

H100 pricing on Spheron → | H200 GPU pricing → | View all GPU pricing →

Get started on Spheron →

STEPS / 05

Quick Setup Guide

  1. Provision a GPU instance on Spheron

    Log into app.spheron.ai. For xLSTM 7B or RWKV-7 7B, select an L40S (48 GB) or A100 SXM4 (80 GB) instance. For higher throughput or larger model variants, select H100 SXM5. Deploy with an Ubuntu 22.04 image and the NVIDIA Docker runtime template. Verify the instance with nvidia-smi before proceeding.

  2. Install the runtime for your chosen model

    For xLSTM: pip install xlstm accelerate transformers (uses the HuggingFace transformers library with the xlstm package and Triton kernels). For RWKV-7 with rwkv.cpp: git clone https://github.com/RWKV/rwkv.cpp && cd rwkv.cpp && mkdir build && cd build && cmake .. -DGGML_CUDA=ON && make -j$(nproc). Both require CUDA 12+ and Python 3.10+.

  3. Download model weights from HuggingFace

    For xLSTM 7B: huggingface-cli download NX-AI/xLSTM-7b --local-dir ./xlstm-7b. For RWKV-7 World v3: huggingface-cli download BlinkDL/rwkv-7-world --local-dir ./rwkv7-world. Verify the download with sha256sum against the model card checksums. Large models may take 10-20 minutes on first pull.

  4. Launch the inference server

    For xLSTM, write a FastAPI wrapper that loads the model via the transformers library (see the deployment section of this guide for a working example). For RWKV-7, launch ChatRWKV's demo server: python ChatRWKV/API_DEMO_CHAT.py --model ./rwkv7-world/RWKV-7-World-v3-7B-bf16.pth --strategy cuda bf16. For a production OpenAI-compatible RWKV-7 endpoint, use the ai00_rwkv_server community project.

  5. Run a baseline benchmark

    Benchmark xLSTM token throughput by sending repeated generation requests to your FastAPI server and measuring tokens per second at 2K, 8K, 16K, and 64K context lengths. Use wrk or a simple Python script to send concurrent requests and measure wall-clock time. Compare output tokens per second at each length against a transformer baseline to validate the linear-attention throughput advantage.

FAQ / 05

Frequently Asked Questions

xLSTM 7B at BF16 requires approximately 15-17 GB for model weights plus a fixed recurrent memory state of 1-3 GB, totaling around 18-20 GB. This fits comfortably on a single L40S (48 GB) or A100 (80 GB). Because xLSTM's memory state is fixed-size, VRAM does not grow with context length the way a transformer KV cache does. At BF16 you can run unlimited context on the same GPU that handles 2K-token inference.

RWKV-7 uses the rwkv.cpp or ChatRWKV runtime rather than a PyTorch inference server. Its recurrent state is smaller and simpler to checkpoint than xLSTM's memory matrix, making RWKV-7 easier to deploy for stateful multi-turn conversations. xLSTM uses the HuggingFace transformers library with the xlstm package; cold starts are similar to other transformer models (seconds, not minutes). RWKV-7's rwkv.cpp has near-instant cold starts and can run on CPU as a fallback, though GPU inference is required for throughput above ~50 tokens/second.

For RWKV-7 World v3 at 7B parameters and BF16, an L40S (48 GB) or A100 PCIe (80 GB) covers all context lengths without VRAM pressure. For high-concurrency serving (batch size 16+), the H100 SXM5's compute throughput reduces latency meaningfully. RWKV-7 is more memory-bandwidth-bound than compute-bound at short context, so GPUs with high HBM bandwidth (H200, H100 SXM5) outperform GDDR6 GPUs (L40S) at batch sizes above 8.

At short contexts (under 4K tokens), transformers are roughly equivalent or slightly faster than xLSTM and RWKV-7 because their CUDA-optimized attention kernels (FlashAttention-3) are highly tuned. From 16K tokens upward, linear-attention models maintain near-constant throughput while transformer throughput degrades quadratically. At 128K tokens, xLSTM and RWKV-7 can deliver 8-15x more tokens per second than a same-size Llama transformer on identical hardware, because there is no KV cache to fill and no attention over the full context window.

Not currently. xLSTM uses the HuggingFace transformers library with the xlstm package, but it is not supported by vLLM or SGLang as of May 2026. RWKV-7 has limited vLLM integration; the production-ready path uses rwkv.cpp or ChatRWKV. Both can be wrapped with a FastAPI layer to expose an OpenAI-compatible endpoint, so downstream tooling (LangChain, LiteLLM, etc.) works without changes. Check model cards on HuggingFace for the latest framework support status before deploying.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.