Test-Time Training on GPU Cloud: Deploy TTT Layers for Adaptive LLM Inference Without Retraining (2026 Guide)

Test-time training adapts a model to the current input during inference, not during a separate training run. This is a different idea from inference-time compute scaling, which spends more GPU on token generation without touching weights. TTT runs gradient updates on a small set of parameters while your request is being processed. The result is a model that has seen your specific input and adjusted itself to handle it better, all before generating the first output token.

The connection to state space models is direct. The Mamba-3 deployment guide covers how SSMs replace the KV cache with a fixed-size recurrent state. TTT takes this idea further: instead of a fixed learned transition rule, the recurrent state is itself a small model that is gradient-updated on your input. That distinction determines which GPU you need and how much latency you should expect.

What Test-Time Training Is (and What It Is Not)

TTT runs a small number of SGD steps on the TTT layer's hidden state using the current input as self-supervised data. The main model weights never change. After the request completes, the hidden state is discarded. The next request starts fresh.

This is distinct from three other techniques that sound similar but work differently:

Inference-time compute scaling generates more tokens (chain-of-thought, best-of-N) to get better answers. No weights are updated. No gradient is computed. The model is unchanged.

Fine-tuning updates the main model weights on labeled data during a training run, before deployment. It produces a new model artifact that is used for all future requests equally.

LoRA per-user adapters train small adapter matrices on user-specific data, then load those matrices at inference time. The adapter is static once trained. TTT, by contrast, adapts dynamically to each input at inference time without any prior training on that input.

Technique	Weights Updated	When Compute Runs	Per-Request Cost	Produces Adapted Artifact
Standard inference	None	Inference only	Baseline	No
Inference-time compute scaling	None	Inference only	10-50x	No
Test-Time Training	TTT hidden state only	Inference (inner loop)	1.6-3x	No (state is discarded)
LoRA per-user fine-tuning	LoRA adapter weights	Training	Training cost amortized	Yes (stored adapter)
Continuous pretraining	All weights	Training	Very high	Yes (new checkpoint)

The LoRA row links to the LoRA multi-adapter serving guide for teams weighing per-user adapter serving as an alternative. The continuous pretraining row links to the continuous pretraining guide for cases where the distribution shift is large enough that TTT and LoRA are both insufficient.

The key practical implication: TTT has no upfront training cost per user, no stored adapter to manage, and no per-user infrastructure. But every inference request pays the gradient computation cost.

TTT Layers vs Mamba vs Attention: The Hidden-State-as-Model Intuition

All three architectures use a different approach to representing context:

A transformer's KV cache grows with every token in the sequence. It stores the full history and attends to all of it on each generation step. Memory scales with sequence length.

Mamba and other SSMs compress the past into a fixed-size recurrent state. The state is updated by a learned transition rule on each new token. Memory stays constant regardless of context length. The transition rule is fixed at training time.

TTT takes the SSM idea one step further: the hidden state is itself a small model, either a linear model (TTT-Linear) or a small MLP (TTT-MLP). When new input tokens arrive, the model runs gradient descent on this small inner model using a self-supervised objective. The inner model adapts to the current input distribution, not just updates according to a fixed rule.

Architecture	Hidden State Type	Update Rule	Context Scaling	Supports OOD Adaptation
Transformer	None (KV cache grows)	Attention over full history	O(n²) memory	No
Mamba/SSM	Fixed matrix	Learned SSM transition	O(1) memory	Limited
TTT-Linear	Small linear model	Inner-loop SGD	O(1) memory	Yes
TTT-MLP	Small MLP	Inner-loop SGD	O(1) memory	Yes (stronger)

The Mamba row is for teams already running SSM inference and evaluating whether the TTT variant makes sense for their workload.

The practical difference: Mamba works well for sequences where the past distribution is stable. TTT works better when the current input comes from a domain the model hasn't seen much of, because the inner model learns a compressed representation of the current input's specific patterns.

The TTT Paper Lineage

The core idea comes from Sun et al., "Learning to (Learn at Test Time)" (2024), which introduced TTT-Linear and TTT-MLP and showed competitive language modeling perplexity against transformer baselines with O(1) context memory.

In 2025, the same group released a JAX implementation for TPU and scaled experiments to 7B parameter models. Perplexity at long contexts (32K+ tokens) showed consistent improvement over same-size transformers.

By 2026, two production-relevant components arrived: a Triton-fused inner-loop kernel (included in ttt-lm-pytorch) that makes the backward pass efficient enough for real inference use, and pretrained checkpoints on Hugging Face under the Test-Time-Training organization (e.g. ttt-linear-1.3b-pile-8k, ttt-mlp-1.3b-pile-8k).

Two reference implementations exist:

ttt-lm-pytorch: PyTorch with a Triton-fused kernel for the inner-loop SGD backward pass. This is the practical starting point for GPU cloud deployment. Easier to modify than the JAX version.
JAX/Flax implementation: Faster on TPU, relevant for teams already on JAX infrastructure. Less directly useful for the GPU cloud deployment case.

Note: pretrained TTT-Linear and TTT-MLP checkpoints are published under the Test-Time-Training organization on Hugging Face (e.g. Test-Time-Training/ttt-linear-1.3b-pile-8k). Model IDs may change between research releases, so verify the current list at huggingface.co/Test-Time-Training before deploying.

Hardware Implications: Inner-Loop SGD at Inference Time

Standard LLM inference on a GPU runs:

Forward pass through each model layer
KV cache update
Token sampling

TTT inference adds three additional steps during the prefill phase for every input chunk:

Forward pass through the TTT layer's small inner model
Backward pass through the TTT layer's small inner model (gradient computation)
SGD update step on the TTT hidden state

The backward pass is the novel cost. It requires storing activations from the TTT layer forward pass during prefill. For TTT-Linear, these activations are a matrix, not a deep graph. The overhead is manageable. For TTT-MLP, activations from a small MLP must be stored, which is larger.

The backward pass also requires compute beyond the forward-only inference budget. A GPU that is 90% utilized running standard inference may not have headroom for TTT at the same batch size. Size for TTT by adding 30-50% to the VRAM and compute budget you would allocate for the same model without TTT.

Here are the GPU options at live Spheron pricing for TTT-Linear serving:

GPU	VRAM	TTT-Linear 1.3B Fits?	TTT-Linear 3B Fits?	TTT-MLP 1.3B Fits?	On-Demand Price	Spot Price
RTX 5090	32 GB GDDR7	Yes (2 inner steps max)	No	Yes (1 inner step)	$0.68/hr	N/A
L40S	48 GB GDDR6	Yes (8 inner steps)	Yes (4 inner steps)	Yes (4 inner steps)	$0.72/hr	$1.64/hr
H100 SXM5	80 GB HBM3e	Yes (16+ inner steps)	Yes (8+ inner steps)	Yes (8+ inner steps)	$3.84/hr	N/A
A100 PCIe	80 GB HBM2e	Yes (16+ inner steps)	Yes (8+ inner steps)	Yes (8+ inner steps)	$1.04/hr	$1.19/hr

Link anchors above: the L40S GPU rental is the sweet spot for TTT-Linear 1.3B and 3B at production inner-loop depths. For teams optimizing on price at smaller model sizes, RTX 5090 on Spheron covers TTT-Linear 1.3B at up to 2 inner-loop steps.

Pricing fluctuates based on GPU availability. The prices above are based on 29 May 2026 and may have changed. Check current GPU pricing → for live rates.

Reference Implementations

Three implementations are available for TTT inference today:

ttt-lm-pytorch is the PyTorch reference with a Triton-fused kernel for the inner-loop backward pass. This is the practical starting point for GPU cloud deployment. The Triton kernel is what makes the backward pass fast enough to not completely dominate total inference latency. Without it, the naive PyTorch implementation is 3-5x slower on the inner loop.

JAX/Flax implementation is faster on TPU and was the basis for the 7B scaling experiments. For GPU cloud, it offers no advantage over the PyTorch path and requires more setup (jax[cuda12], flax). Teams already running JAX on GPU may find it easier to adapt.

vLLM TTT integration is not yet available as a stable, purpose-built plugin for TTT-Linear or TTT-MLP models. Community projects like tLLM add test-time training hooks to vLLM's v1 runtime, but they are not specific to the TTT architecture from Sun et al. If a dedicated vLLM TTT backend emerges, verify it handles per-request state cleanup correctly before using it in production. For now, the ttt-lm-pytorch FastAPI wrapper is the recommended serving path.

Installing the PyTorch implementation:

bash

# Clone and install ttt-lm-pytorch
git clone https://github.com/test-time-training/ttt-lm-pytorch
cd ttt-lm-pytorch
pip install torch>=2.3 transformers accelerate triton

Loading a checkpoint and running inference:

python

from ttt import TTTForCausalLM
from transformers import AutoTokenizer

model = TTTForCausalLM.from_pretrained("Test-Time-Training/ttt-linear-1.3b-pile-8k").cuda()
tokenizer = AutoTokenizer.from_pretrained("Test-Time-Training/ttt-linear-1.3b-pile-8k")
inputs = tokenizer("The attention mechanism in transformers", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, use_ttt=True, ttt_inner_steps=4)
print(tokenizer.decode(outputs[0]))

Deploying TTT-Linear 1.3B on Spheron L40S

The L40S is the right GPU for TTT-Linear 1.3B and 3B at 4-8 inner-loop steps. At $0.72/hr on-demand, it's the most cost-effective option that supports the full inner-loop depth range without VRAM pressure.

Step-by-step:

1. Provision the instance. Log into app.spheron.ai, select L40S (48 GB), choose Ubuntu 22.04 with NVIDIA Docker runtime. Verify with nvidia-smi that CUDA 12.x is available and 48 GB GDDR6 is present.

2. Install dependencies.

bash

git clone https://github.com/test-time-training/ttt-lm-pytorch
cd ttt-lm-pytorch
pip install torch>=2.3 transformers accelerate triton

3. Download the checkpoint from the Test-Time-Training HuggingFace organization:

bash

huggingface-cli download Test-Time-Training/ttt-linear-1.3b-pile-8k --local-dir ./ttt-linear-1.3b

4. Launch the serving endpoint. The ttt-lm-pytorch repo includes a FastAPI-based serving script. For production OpenAI-compatible routing, wrap it:

python

from ttt import TTTForCausalLM
from transformers import AutoTokenizer
import uvicorn
import asyncio
import uuid
import time
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
inference_lock = asyncio.Lock()
model = TTTForCausalLM.from_pretrained("./ttt-linear-1.3b").cuda()
tokenizer = AutoTokenizer.from_pretrained("./ttt-linear-1.3b")

class CompletionRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 256
    ttt_inner_steps: int = 4
    model: str = "ttt-linear-1.3b"

@app.post("/v1/completions")
async def complete(req: CompletionRequest):
    inputs = tokenizer(req.prompt, return_tensors="pt").to("cuda")
    loop = asyncio.get_event_loop()
    async with inference_lock:
        outputs = await loop.run_in_executor(
            None,
            lambda: model.generate(
                **inputs,
                max_new_tokens=req.max_new_tokens,
                use_ttt=True,
                ttt_inner_steps=req.ttt_inner_steps,
            )
        )
    generated_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[-1]:], skip_special_tokens=True)
    return {
        "id": f"cmpl-{uuid.uuid4().hex[:8]}",
        "object": "text_completion",
        "created": int(time.time()),
        "model": req.model,
        "choices": [{"text": generated_text, "index": 0, "finish_reason": "stop"}],
        "usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0},
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

5. Test with a long-context input. TTT shows the most benefit at 8K+ tokens where the hidden-state adaptation accumulates meaningful signal. A short 512-token input will show little improvement over standard inference.

Throughput and Latency Tradeoffs

Inner-loop step count is the main dial you control. More steps mean better hidden-state adaptation and higher quality, but also higher latency per token.

Inner-Loop Steps	Relative Latency vs Baseline	Perplexity on Long-Context Benchmark	Best For
0 (disabled)	1x	Baseline	Comparison only
1	1.2x	~5% improvement	Latency-sensitive
4	1.7x	~15% improvement	Balanced
8	2.5x	~20% improvement	Quality-first
16	4x	~22% improvement	Diminishing returns

Numbers above are illustrative based on ttt-lm-pytorch defaults. Run python benchmark_ttt.py from the repo to measure actual values on your hardware.

Gains plateau around 8 steps. For most production workloads, 4 steps is the sweet spot. The jump from 4 to 8 steps adds 50% more latency for only 5 more percentage points of perplexity improvement. Beyond 8, you're getting less than 1 point per additional step at roughly equal latency cost per step.

TTT-Linear gets more benefit than TTT-MLP from the first few inner-loop steps because the linear model converges faster on compressed representations. TTT-MLP keeps improving at higher step counts but requires more VRAM for activations.

When TTT Wins

Three scenarios where TTT provides real value:

1. Long-context adaptation. Code repositories, long legal documents, research papers where domain vocabulary and writing style shift across the document. The TTT hidden state accumulates input-specific signal that improves token predictions later in the document. Standard inference has no such accumulation. At contexts below 2K tokens, there isn't enough signal for the inner-loop adaptation to matter, so TTT's overhead outweighs its benefit.

2. Distribution shift at inference time. A product serving users from different domains (medical, legal, coding, finance) where each session's input looks different from the next. TTT adapts without any per-user stored artifacts. Each request adapts to its own input and discards the state afterward. This is the scenario where TTT beats LoRA: LoRA requires knowing the domain in advance and training an adapter for it.

3. Personalization without fine-tuning infrastructure. Teams that cannot afford per-user LoRA training (data collection, training jobs, adapter storage, adapter loading logic) can use TTT as a zero-overhead-to-deploy personalization layer. The cost is paid per inference request, not upfront.

When TTT loses: short contexts where adaptation signal is too thin, latency-sensitive APIs where 1.7-2.5x overhead breaks SLAs, or when a well-trained LoRA adapter already exists for the domain. LoRA at inference has zero gradient overhead once the adapter is loaded. TTT always pays the gradient cost.

Integrating TTT into a Serving Stack

TTT hidden state has a strict per-request lifecycle:

Initialize: fresh hidden state at the start of each request (random or zero initialization)
Prefill: hidden state updates via inner-loop SGD as input tokens are processed
Generate: state is maintained and continues updating during token generation
Discard: state is dropped when the request completes

No state persists between requests. This is different from Mamba's stateful inference, where the hidden state can in principle persist across turns. For TTT, each turn must re-process the full conversation history to rebuild its adapted state, unless you implement an explicit state cache keyed by conversation ID.

For production serving, the per-request state lifecycle means your serving framework needs to allocate and release hidden-state buffers per request. The ttt-lm-pytorch serving stack handles this. A custom vLLM integration would need to map TTT states into the paged attention memory pool.

python

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.completions.create(
    model="ttt-linear-1.3b",
    prompt=long_context_document + "\n\nSummarize the above.",
    extra_body={"ttt_inner_steps": 4}  # per-request override
)

This pattern is distinct from how inference-time compute scaling works in a serving stack. Compute scaling approaches (chain-of-thought, Best-of-N) have no per-request gradient state, so they use simpler batching strategies than TTT.

Cost Analysis on Spheron

Worked example: per-user TTT adaptation on L40S vs per-user LoRA fine-tuning on the same GPU.

Scenario: 10,000 users, each sends 50 queries/day with 4,000-token contexts.

TTT approach (L40S on Spheron at $0.72/hr):

Model: TTT-Linear 1.3B, 4 inner-loop steps
Throughput at 4K context, 4 inner-loop steps: approximately 320 tokens/sec on L40S
Time per query (4K input + 256 output tokens): approximately 14 seconds
L40S GPU utilization per query: 14s / 3600s = 0.0039 GPU-hours
Cost per query: 0.0039 x $0.72 = $0.0028
Cost per user per day (50 queries): $0.140
Cost for 10,000 users per day: $1,400/day

LoRA approach (L40S on Spheron, using vLLM multi-adapter serving):

Per-user LoRA training: 1-3 hours on L40S = $0.72-$2.16 one-time training cost per user
Inference overhead at 4K context: near zero (LoRA adapter merged at load time)
At batch size 16, throughput at 4K context: approximately 1,800 tokens/sec
Cost per query (4K input + 256 output tokens): approximately 0.0025 GPU-hours x $0.72 = $0.0018
Cost for 10,000 users per day (50 queries each): $900/day
Break-even on upfront training cost: $0.72-$2.16 per user / $0.0010 per-query savings = 720-2,160 queries per user

For reference on multi-adapter LoRA serving, a single H100 on Spheron can serve hundreds of LoRA adapters simultaneously using vLLM's multi-adapter support.

Decision crossover: TTT is more cost-effective than LoRA when the per-user query volume is low enough that per-user training cost never amortizes. For users who send fewer than 720 queries before churning, TTT wins on total cost. For users who stay active for months at 50 queries/day, LoRA eventually costs less per query.

Pricing fluctuates based on GPU availability. The prices above are based on 29 May 2026 and may have changed. Check current GPU pricing → for live rates.

Check the pricing page for the latest L40S and H100 rates before running your own cost model.

TTT vs Continuous Pretraining vs LoRA: Decision Guide

Scenario	Recommended Approach	Why
Domain shifts per request, no training budget	TTT	No upfront cost, adapts at inference time
1-10 known domains, budget for one fine-tuning run	LoRA per domain	Zero inference overhead, good quality
100+ users, each with unique domain	TTT or LoRA per user	LoRA requires per-user training; TTT does not
Full domain shift for a new product	Continuous pretraining	TTT and LoRA are insufficient for large distribution shifts
Low-latency API (<200ms P50)	Standard inference or LoRA	TTT overhead (1.7-2.5x) breaks SLAs

The "continuous pretraining" row covers cases where the model needs to internalize a domain's vocabulary at the weight level. That's not a problem TTT or LoRA can solve. The "LoRA per domain" row covers multi-tenant adapter infrastructure, which the LoRA serving guide details.

The short version: TTT is the right choice when you need adaptation without any training infrastructure and can accept 1.7-2.5x inference latency overhead. For everything else, check whether LoRA or CPT better matches your constraint.

TTT fills a specific gap: adaptation that is too dynamic for LoRA (you don't know the domain in advance) but too incremental to justify a full CPT run. Long-context document analysis, multi-domain products serving diverse user bases, and personalization without per-user training pipelines are the clearest fits.

TTT inference runs inner-loop SGD at token generation time, which raises GPU cost-per-token but removes the need for per-user training infrastructure. Spheron's L40S and RTX 5090 instances are the most cost-effective starting point for TTT model serving.
View all GPU pricing →
On-demand H100 and L40S on Spheron →

STEPS / 06

Quick Setup Guide

Provision an L40S instance on Spheron
Log into app.spheron.ai, select the L40S (48 GB) from the GPU catalog, choose Ubuntu 22.04 with the NVIDIA Docker runtime template, and deploy. Verify with nvidia-smi that the 48 GB GDDR6 is available and that CUDA 12.x is installed.
Install TTT dependencies
Clone the reference repo: git clone https://github.com/test-time-training/ttt-lm-pytorch. Then install: pip install torch>=2.3 transformers accelerate triton. For the JAX implementation, use pip install jax[cuda12] flax instead. The Triton kernel is required for the fused inner-loop SGD backward pass that makes TTT inference fast enough for production.
Load a TTT-Linear checkpoint
Download a pretrained TTT-Linear 1.3B checkpoint from Hugging Face under the Test-Time-Training organization. Run: huggingface-cli download Test-Time-Training/ttt-linear-1.3b-pile-8k --local-dir ./ttt-linear-1.3b. Load with: from ttt import TTTForCausalLM; model = TTTForCausalLM.from_pretrained('./ttt-linear-1.3b').cuda(). Pass ttt_inner_steps=4 to model.generate() to configure inner-loop SGD depth.
Run inference with the TTT model
Use the standard Hugging Face generate API: outputs = model.generate(input_ids, max_new_tokens=512, use_ttt=True). The TTT hidden state initializes per-call and updates with each input token during the prefill phase. At the end of generation, the hidden state is discarded - there is no cross-request state leakage.
Deploy via the ttt-lm-pytorch serving stack
For production serving, wrap the TTT model in a FastAPI endpoint that handles per-request state lifecycle. The ttt-lm-pytorch reference implementation includes a basic serving script. Launch with: python serve.py --model-path <checkpoint-path> --ttt-inner-steps 4 --port 8000.
Benchmark inner-loop steps vs token latency
Run the benchmark script included in ttt-lm-pytorch: python benchmark_ttt.py --model <checkpoint-path> --inner-steps 1 4 8 --input-len 4096 8192 16384 --output-len 256. Record tokens/second and time-to-first-token at each combination. This produces the data for your latency-quality tradeoff decision.

FAQ / 05

Frequently Asked Questions

Test-time training (TTT) updates a small subset of model parameters - the TTT layer's hidden state - using inner-loop SGD on the current input sequence during inference. The weights of the main transformer layers stay frozen. Inference-time compute scaling, by contrast, generates more tokens (chain-of-thought or best-of-N passes) without touching any weights. TTT creates a per-request adapted model; inference-time scaling creates a longer reasoning trace.

TTT inner-loop SGD runs on the GPU at inference time, which means you need compute headroom beyond what standard autoregressive inference uses. For a TTT-Linear 1.3B model with 4 inner-loop steps at batch size 1, an L40S (48 GB GDDR6) provides comfortable headroom. For larger TTT models (7B+) or higher inner-loop step counts, an H100 SXM5 (80 GB HBM3e) or H200 (141 GB HBM3e) is recommended. The L40S and RTX 5090 are the most cost-effective on-demand options for TTT-Linear up to 3B parameters.

At 1 inner-loop step, TTT-Linear adds roughly 15-25% latency overhead per token compared to the equivalent non-TTT forward pass. At 4 inner-loop steps (the default in the ttt-lm-pytorch reference implementation), latency overhead is 60-90%. At 8 steps, overhead exceeds 2x. The tradeoff is context-dependent quality improvement: TTT is most useful at long contexts (8K+ tokens) where the hidden-state adaptation accumulates meaningful signal.

As of mid-2026, vLLM does not natively support TTT-Linear models. Community projects like tLLM (github.com/LinesHogan/tLLM) provide test-time training extensions for vLLM, but they are not purpose-built for the TTT-Linear architecture. For production deployments, using ttt-lm-pytorch directly with a FastAPI wrapper is the most reliable path. Continuous batching with TTT would require per-request state management because each request maintains its own TTT hidden state.

Use TTT when you need inference-time adaptation without storing per-user adapter weights. TTT adapts to the current input context dynamically, which suits distribution-shift scenarios (each request has different domain vocabulary) and long-context tasks (code repositories, long documents). LoRA per-user adapters are better when users have consistent domain needs that can be captured in a trained adapter and the per-user fine-tuning cost is acceptable. TTT has higher per-token inference cost; LoRA has higher upfront fine-tuning cost.

What Test-Time Training Is (and What It Is Not)

TTT Layers vs Mamba vs Attention: The Hidden-State-as-Model Intuition

The TTT Paper Lineage

Hardware Implications: Inner-Loop SGD at Inference Time

Reference Implementations

Deploying TTT-Linear 1.3B on Spheron L40S

Throughput and Latency Tradeoffs

When TTT Wins

Integrating TTT into a Serving Stack

Cost Analysis on Spheron

TTT vs Continuous Pretraining vs LoRA: Decision Guide

Quick Setup Guide

Provision an L40S instance on Spheron

Install TTT dependencies

Load a TTT-Linear checkpoint

Run inference with the TTT model

Deploy via the ttt-lm-pytorch serving stack

Benchmark inner-loop steps vs token latency

Frequently Asked Questions

01What is test-time training and how is it different from inference-time compute scaling?

02What GPU hardware does TTT inference require?

03How much slower is TTT inference compared to standard LLM inference?

04Can I deploy TTT models with vLLM?

05When should I use TTT instead of LoRA per-user adapters?

Build what's next.