Engineering

Test-Time Training on GPU Cloud: Deploy TTT Layers for Adaptive LLM Inference Without Retraining (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 29, 2026
Test-Time TrainingTTT LLMTest Time Training GPUTTT Layers DeploymentAdaptive Inference LLMvLLML40SGPU Cloud
Test-Time Training on GPU Cloud: Deploy TTT Layers for Adaptive LLM Inference Without Retraining (2026 Guide)

Test-time training adapts a model to the current input during inference, not during a separate training run. This is a different idea from inference-time compute scaling, which spends more GPU on token generation without touching weights. TTT runs gradient updates on a small set of parameters while your request is being processed. The result is a model that has seen your specific input and adjusted itself to handle it better, all before generating the first output token.

The connection to state space models is direct. The Mamba-3 deployment guide covers how SSMs replace the KV cache with a fixed-size recurrent state. TTT takes this idea further: instead of a fixed learned transition rule, the recurrent state is itself a small model that is gradient-updated on your input. That distinction determines which GPU you need and how much latency you should expect.

What Test-Time Training Is (and What It Is Not)

TTT runs a small number of SGD steps on the TTT layer's hidden state using the current input as self-supervised data. The main model weights never change. After the request completes, the hidden state is discarded. The next request starts fresh.

This is distinct from three other techniques that sound similar but work differently:

Inference-time compute scaling generates more tokens (chain-of-thought, best-of-N) to get better answers. No weights are updated. No gradient is computed. The model is unchanged.

Fine-tuning updates the main model weights on labeled data during a training run, before deployment. It produces a new model artifact that is used for all future requests equally.

LoRA per-user adapters train small adapter matrices on user-specific data, then load those matrices at inference time. The adapter is static once trained. TTT, by contrast, adapts dynamically to each input at inference time without any prior training on that input.

TechniqueWeights UpdatedWhen Compute RunsPer-Request CostProduces Adapted Artifact
Standard inferenceNoneInference onlyBaselineNo
Inference-time compute scalingNoneInference only10-50xNo
Test-Time TrainingTTT hidden state onlyInference (inner loop)1.6-3xNo (state is discarded)
LoRA per-user fine-tuningLoRA adapter weightsTrainingTraining cost amortizedYes (stored adapter)
Continuous pretrainingAll weightsTrainingVery highYes (new checkpoint)

The LoRA row links to the LoRA multi-adapter serving guide for teams weighing per-user adapter serving as an alternative. The continuous pretraining row links to the continuous pretraining guide for cases where the distribution shift is large enough that TTT and LoRA are both insufficient.

The key practical implication: TTT has no upfront training cost per user, no stored adapter to manage, and no per-user infrastructure. But every inference request pays the gradient computation cost.

TTT Layers vs Mamba vs Attention: The Hidden-State-as-Model Intuition

All three architectures use a different approach to representing context:

A transformer's KV cache grows with every token in the sequence. It stores the full history and attends to all of it on each generation step. Memory scales with sequence length.

Mamba and other SSMs compress the past into a fixed-size recurrent state. The state is updated by a learned transition rule on each new token. Memory stays constant regardless of context length. The transition rule is fixed at training time.

TTT takes the SSM idea one step further: the hidden state is itself a small model, either a linear model (TTT-Linear) or a small MLP (TTT-MLP). When new input tokens arrive, the model runs gradient descent on this small inner model using a self-supervised objective. The inner model adapts to the current input distribution, not just updates according to a fixed rule.

ArchitectureHidden State TypeUpdate RuleContext ScalingSupports OOD Adaptation
TransformerNone (KV cache grows)Attention over full historyO(n²) memoryNo
Mamba/SSMFixed matrixLearned SSM transitionO(1) memoryLimited
TTT-LinearSmall linear modelInner-loop SGDO(1) memoryYes
TTT-MLPSmall MLPInner-loop SGDO(1) memoryYes (stronger)

The Mamba row is for teams already running SSM inference and evaluating whether the TTT variant makes sense for their workload.

The practical difference: Mamba works well for sequences where the past distribution is stable. TTT works better when the current input comes from a domain the model hasn't seen much of, because the inner model learns a compressed representation of the current input's specific patterns.

The TTT Paper Lineage

The core idea comes from Sun et al., "Learning to (Learn at Test Time)" (2024), which introduced TTT-Linear and TTT-MLP and showed competitive language modeling perplexity against transformer baselines with O(1) context memory.

In 2025, the same group released a JAX implementation for TPU and scaled experiments to 7B parameter models. Perplexity at long contexts (32K+ tokens) showed consistent improvement over same-size transformers.

By 2026, two production-relevant components arrived: a Triton-fused inner-loop kernel (included in ttt-lm-pytorch) that makes the backward pass efficient enough for real inference use, and pretrained checkpoints on Hugging Face under the Test-Time-Training organization (e.g. ttt-linear-1.3b-pile-8k, ttt-mlp-1.3b-pile-8k).

Two reference implementations exist:

  • ttt-lm-pytorch: PyTorch with a Triton-fused kernel for the inner-loop SGD backward pass. This is the practical starting point for GPU cloud deployment. Easier to modify than the JAX version.
  • JAX/Flax implementation: Faster on TPU, relevant for teams already on JAX infrastructure. Less directly useful for the GPU cloud deployment case.

Note: pretrained TTT-Linear and TTT-MLP checkpoints are published under the Test-Time-Training organization on Hugging Face (e.g. Test-Time-Training/ttt-linear-1.3b-pile-8k). Model IDs may change between research releases, so verify the current list at huggingface.co/Test-Time-Training before deploying.

Hardware Implications: Inner-Loop SGD at Inference Time

Standard LLM inference on a GPU runs:

  1. Forward pass through each model layer
  2. KV cache update
  3. Token sampling

TTT inference adds three additional steps during the prefill phase for every input chunk:

  1. Forward pass through the TTT layer's small inner model
  2. Backward pass through the TTT layer's small inner model (gradient computation)
  3. SGD update step on the TTT hidden state

The backward pass is the novel cost. It requires storing activations from the TTT layer forward pass during prefill. For TTT-Linear, these activations are a matrix, not a deep graph. The overhead is manageable. For TTT-MLP, activations from a small MLP must be stored, which is larger.

The backward pass also requires compute beyond the forward-only inference budget. A GPU that is 90% utilized running standard inference may not have headroom for TTT at the same batch size. Size for TTT by adding 30-50% to the VRAM and compute budget you would allocate for the same model without TTT.

Here are the GPU options at live Spheron pricing for TTT-Linear serving:

GPUVRAMTTT-Linear 1.3B Fits?TTT-Linear 3B Fits?TTT-MLP 1.3B Fits?On-Demand PriceSpot Price
RTX 509032 GB GDDR7Yes (2 inner steps max)NoYes (1 inner step)$0.68/hrN/A
L40S48 GB GDDR6Yes (8 inner steps)Yes (4 inner steps)Yes (4 inner steps)$0.72/hr$1.64/hr
H100 SXM580 GB HBM3eYes (16+ inner steps)Yes (8+ inner steps)Yes (8+ inner steps)$3.84/hrN/A
A100 PCIe80 GB HBM2eYes (16+ inner steps)Yes (8+ inner steps)Yes (8+ inner steps)$1.04/hr$1.19/hr

Link anchors above: the L40S GPU rental is the sweet spot for TTT-Linear 1.3B and 3B at production inner-loop depths. For teams optimizing on price at smaller model sizes, RTX 5090 on Spheron covers TTT-Linear 1.3B at up to 2 inner-loop steps.

Pricing fluctuates based on GPU availability. The prices above are based on 29 May 2026 and may have changed. Check current GPU pricing → for live rates.

Reference Implementations

Three implementations are available for TTT inference today:

ttt-lm-pytorch is the PyTorch reference with a Triton-fused kernel for the inner-loop backward pass. This is the practical starting point for GPU cloud deployment. The Triton kernel is what makes the backward pass fast enough to not completely dominate total inference latency. Without it, the naive PyTorch implementation is 3-5x slower on the inner loop.

JAX/Flax implementation is faster on TPU and was the basis for the 7B scaling experiments. For GPU cloud, it offers no advantage over the PyTorch path and requires more setup (jax[cuda12], flax). Teams already running JAX on GPU may find it easier to adapt.

vLLM TTT integration is not yet available as a stable, purpose-built plugin for TTT-Linear or TTT-MLP models. Community projects like tLLM add test-time training hooks to vLLM's v1 runtime, but they are not specific to the TTT architecture from Sun et al. If a dedicated vLLM TTT backend emerges, verify it handles per-request state cleanup correctly before using it in production. For now, the ttt-lm-pytorch FastAPI wrapper is the recommended serving path.

Installing the PyTorch implementation:

bash
# Clone and install ttt-lm-pytorch
git clone https://github.com/test-time-training/ttt-lm-pytorch
cd ttt-lm-pytorch
pip install torch>=2.3 transformers accelerate triton

Loading a checkpoint and running inference:

python
from ttt import TTTForCausalLM
from transformers import AutoTokenizer

model = TTTForCausalLM.from_pretrained("Test-Time-Training/ttt-linear-1.3b-pile-8k").cuda()
tokenizer = AutoTokenizer.from_pretrained("Test-Time-Training/ttt-linear-1.3b-pile-8k")
inputs = tokenizer("The attention mechanism in transformers", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, use_ttt=True, ttt_inner_steps=4)
print(tokenizer.decode(outputs[0]))

Deploying TTT-Linear 1.3B on Spheron L40S

The L40S is the right GPU for TTT-Linear 1.3B and 3B at 4-8 inner-loop steps. At $0.72/hr on-demand, it's the most cost-effective option that supports the full inner-loop depth range without VRAM pressure.

Step-by-step:

1. Provision the instance. Log into app.spheron.ai, select L40S (48 GB), choose Ubuntu 22.04 with NVIDIA Docker runtime. Verify with nvidia-smi that CUDA 12.x is available and 48 GB GDDR6 is present.

2. Install dependencies.

bash
git clone https://github.com/test-time-training/ttt-lm-pytorch
cd ttt-lm-pytorch
pip install torch>=2.3 transformers accelerate triton

3. Download the checkpoint from the Test-Time-Training HuggingFace organization:

bash
huggingface-cli download Test-Time-Training/ttt-linear-1.3b-pile-8k --local-dir ./ttt-linear-1.3b

4. Launch the serving endpoint. The ttt-lm-pytorch repo includes a FastAPI-based serving script. For production OpenAI-compatible routing, wrap it:

python
from ttt import TTTForCausalLM
from transformers import AutoTokenizer
import uvicorn
import asyncio
import uuid
import time
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
inference_lock = asyncio.Lock()
model = TTTForCausalLM.from_pretrained("./ttt-linear-1.3b").cuda()
tokenizer = AutoTokenizer.from_pretrained("./ttt-linear-1.3b")

class CompletionRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 256
    ttt_inner_steps: int = 4
    model: str = "ttt-linear-1.3b"

@app.post("/v1/completions")
async def complete(req: CompletionRequest):
    inputs = tokenizer(req.prompt, return_tensors="pt").to("cuda")
    loop = asyncio.get_event_loop()
    async with inference_lock:
        outputs = await loop.run_in_executor(
            None,
            lambda: model.generate(
                **inputs,
                max_new_tokens=req.max_new_tokens,
                use_ttt=True,
                ttt_inner_steps=req.ttt_inner_steps,
            )
        )
    generated_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[-1]:], skip_special_tokens=True)
    return {
        "id": f"cmpl-{uuid.uuid4().hex[:8]}",
        "object": "text_completion",
        "created": int(time.time()),
        "model": req.model,
        "choices": [{"text": generated_text, "index": 0, "finish_reason": "stop"}],
        "usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0},
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

5. Test with a long-context input. TTT shows the most benefit at 8K+ tokens where the hidden-state adaptation accumulates meaningful signal. A short 512-token input will show little improvement over standard inference.

Throughput and Latency Tradeoffs

Inner-loop step count is the main dial you control. More steps mean better hidden-state adaptation and higher quality, but also higher latency per token.

Inner-Loop StepsRelative Latency vs BaselinePerplexity on Long-Context BenchmarkBest For
0 (disabled)1xBaselineComparison only
11.2x~5% improvementLatency-sensitive
41.7x~15% improvementBalanced
82.5x~20% improvementQuality-first
164x~22% improvementDiminishing returns

Numbers above are illustrative based on ttt-lm-pytorch defaults. Run python benchmark_ttt.py from the repo to measure actual values on your hardware.

Gains plateau around 8 steps. For most production workloads, 4 steps is the sweet spot. The jump from 4 to 8 steps adds 50% more latency for only 5 more percentage points of perplexity improvement. Beyond 8, you're getting less than 1 point per additional step at roughly equal latency cost per step.

TTT-Linear gets more benefit than TTT-MLP from the first few inner-loop steps because the linear model converges faster on compressed representations. TTT-MLP keeps improving at higher step counts but requires more VRAM for activations.

When TTT Wins

Three scenarios where TTT provides real value:

1. Long-context adaptation. Code repositories, long legal documents, research papers where domain vocabulary and writing style shift across the document. The TTT hidden state accumulates input-specific signal that improves token predictions later in the document. Standard inference has no such accumulation. At contexts below 2K tokens, there isn't enough signal for the inner-loop adaptation to matter, so TTT's overhead outweighs its benefit.

2. Distribution shift at inference time. A product serving users from different domains (medical, legal, coding, finance) where each session's input looks different from the next. TTT adapts without any per-user stored artifacts. Each request adapts to its own input and discards the state afterward. This is the scenario where TTT beats LoRA: LoRA requires knowing the domain in advance and training an adapter for it.

3. Personalization without fine-tuning infrastructure. Teams that cannot afford per-user LoRA training (data collection, training jobs, adapter storage, adapter loading logic) can use TTT as a zero-overhead-to-deploy personalization layer. The cost is paid per inference request, not upfront.

When TTT loses: short contexts where adaptation signal is too thin, latency-sensitive APIs where 1.7-2.5x overhead breaks SLAs, or when a well-trained LoRA adapter already exists for the domain. LoRA at inference has zero gradient overhead once the adapter is loaded. TTT always pays the gradient cost.

Integrating TTT into a Serving Stack

TTT hidden state has a strict per-request lifecycle:

  1. Initialize: fresh hidden state at the start of each request (random or zero initialization)
  2. Prefill: hidden state updates via inner-loop SGD as input tokens are processed
  3. Generate: state is maintained and continues updating during token generation
  4. Discard: state is dropped when the request completes

No state persists between requests. This is different from Mamba's stateful inference, where the hidden state can in principle persist across turns. For TTT, each turn must re-process the full conversation history to rebuild its adapted state, unless you implement an explicit state cache keyed by conversation ID.

For production serving, the per-request state lifecycle means your serving framework needs to allocate and release hidden-state buffers per request. The ttt-lm-pytorch serving stack handles this. A custom vLLM integration would need to map TTT states into the paged attention memory pool.

python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.completions.create(
    model="ttt-linear-1.3b",
    prompt=long_context_document + "\n\nSummarize the above.",
    extra_body={"ttt_inner_steps": 4}  # per-request override
)

This pattern is distinct from how inference-time compute scaling works in a serving stack. Compute scaling approaches (chain-of-thought, Best-of-N) have no per-request gradient state, so they use simpler batching strategies than TTT.

Cost Analysis on Spheron

Worked example: per-user TTT adaptation on L40S vs per-user LoRA fine-tuning on the same GPU.

Scenario: 10,000 users, each sends 50 queries/day with 4,000-token contexts.

TTT approach (L40S on Spheron at $0.72/hr):

  • Model: TTT-Linear 1.3B, 4 inner-loop steps
  • Throughput at 4K context, 4 inner-loop steps: approximately 320 tokens/sec on L40S
  • Time per query (4K input + 256 output tokens): approximately 14 seconds
  • L40S GPU utilization per query: 14s / 3600s = 0.0039 GPU-hours
  • Cost per query: 0.0039 x $0.72 = $0.0028
  • Cost per user per day (50 queries): $0.140
  • Cost for 10,000 users per day: $1,400/day

LoRA approach (L40S on Spheron, using vLLM multi-adapter serving):

  • Per-user LoRA training: 1-3 hours on L40S = $0.72-$2.16 one-time training cost per user
  • Inference overhead at 4K context: near zero (LoRA adapter merged at load time)
  • At batch size 16, throughput at 4K context: approximately 1,800 tokens/sec
  • Cost per query (4K input + 256 output tokens): approximately 0.0025 GPU-hours x $0.72 = $0.0018
  • Cost for 10,000 users per day (50 queries each): $900/day
  • Break-even on upfront training cost: $0.72-$2.16 per user / $0.0010 per-query savings = 720-2,160 queries per user

For reference on multi-adapter LoRA serving, a single H100 on Spheron can serve hundreds of LoRA adapters simultaneously using vLLM's multi-adapter support.

Decision crossover: TTT is more cost-effective than LoRA when the per-user query volume is low enough that per-user training cost never amortizes. For users who send fewer than 720 queries before churning, TTT wins on total cost. For users who stay active for months at 50 queries/day, LoRA eventually costs less per query.

Pricing fluctuates based on GPU availability. The prices above are based on 29 May 2026 and may have changed. Check current GPU pricing → for live rates.

Check the pricing page for the latest L40S and H100 rates before running your own cost model.

TTT vs Continuous Pretraining vs LoRA: Decision Guide

ScenarioRecommended ApproachWhy
Domain shifts per request, no training budgetTTTNo upfront cost, adapts at inference time
1-10 known domains, budget for one fine-tuning runLoRA per domainZero inference overhead, good quality
100+ users, each with unique domainTTT or LoRA per userLoRA requires per-user training; TTT does not
Full domain shift for a new productContinuous pretrainingTTT and LoRA are insufficient for large distribution shifts
Low-latency API (<200ms P50)Standard inference or LoRATTT overhead (1.7-2.5x) breaks SLAs

The "continuous pretraining" row covers cases where the model needs to internalize a domain's vocabulary at the weight level. That's not a problem TTT or LoRA can solve. The "LoRA per domain" row covers multi-tenant adapter infrastructure, which the LoRA serving guide details.

The short version: TTT is the right choice when you need adaptation without any training infrastructure and can accept 1.7-2.5x inference latency overhead. For everything else, check whether LoRA or CPT better matches your constraint.

TTT fills a specific gap: adaptation that is too dynamic for LoRA (you don't know the domain in advance) but too incremental to justify a full CPT run. Long-context document analysis, multi-domain products serving diverse user bases, and personalization without per-user training pipelines are the clearest fits.

TTT inference runs inner-loop SGD at token generation time, which raises GPU cost-per-token but removes the need for per-user training infrastructure. Spheron's L40S and RTX 5090 instances are the most cost-effective starting point for TTT model serving.

View all GPU pricing →

On-demand H100 and L40S on Spheron →

STEPS / 06

Quick Setup Guide

  1. Provision an L40S instance on Spheron

    Log into app.spheron.ai, select the L40S (48 GB) from the GPU catalog, choose Ubuntu 22.04 with the NVIDIA Docker runtime template, and deploy. Verify with nvidia-smi that the 48 GB GDDR6 is available and that CUDA 12.x is installed.

  2. Install TTT dependencies

    Clone the reference repo: git clone https://github.com/test-time-training/ttt-lm-pytorch. Then install: pip install torch>=2.3 transformers accelerate triton. For the JAX implementation, use pip install jax[cuda12] flax instead. The Triton kernel is required for the fused inner-loop SGD backward pass that makes TTT inference fast enough for production.

  3. Load a TTT-Linear checkpoint

    Download a pretrained TTT-Linear 1.3B checkpoint from Hugging Face under the Test-Time-Training organization. Run: huggingface-cli download Test-Time-Training/ttt-linear-1.3b-pile-8k --local-dir ./ttt-linear-1.3b. Load with: from ttt import TTTForCausalLM; model = TTTForCausalLM.from_pretrained('./ttt-linear-1.3b').cuda(). Pass ttt_inner_steps=4 to model.generate() to configure inner-loop SGD depth.

  4. Run inference with the TTT model

    Use the standard Hugging Face generate API: outputs = model.generate(input_ids, max_new_tokens=512, use_ttt=True). The TTT hidden state initializes per-call and updates with each input token during the prefill phase. At the end of generation, the hidden state is discarded - there is no cross-request state leakage.

  5. Deploy via the ttt-lm-pytorch serving stack

    For production serving, wrap the TTT model in a FastAPI endpoint that handles per-request state lifecycle. The ttt-lm-pytorch reference implementation includes a basic serving script. Launch with: python serve.py --model-path <checkpoint-path> --ttt-inner-steps 4 --port 8000.

  6. Benchmark inner-loop steps vs token latency

    Run the benchmark script included in ttt-lm-pytorch: python benchmark_ttt.py --model <checkpoint-path> --inner-steps 1 4 8 --input-len 4096 8192 16384 --output-len 256. Record tokens/second and time-to-first-token at each combination. This produces the data for your latency-quality tradeoff decision.

FAQ / 05

Frequently Asked Questions

Test-time training (TTT) updates a small subset of model parameters - the TTT layer's hidden state - using inner-loop SGD on the current input sequence during inference. The weights of the main transformer layers stay frozen. Inference-time compute scaling, by contrast, generates more tokens (chain-of-thought or best-of-N passes) without touching any weights. TTT creates a per-request adapted model; inference-time scaling creates a longer reasoning trace.

TTT inner-loop SGD runs on the GPU at inference time, which means you need compute headroom beyond what standard autoregressive inference uses. For a TTT-Linear 1.3B model with 4 inner-loop steps at batch size 1, an L40S (48 GB GDDR6) provides comfortable headroom. For larger TTT models (7B+) or higher inner-loop step counts, an H100 SXM5 (80 GB HBM3e) or H200 (141 GB HBM3e) is recommended. The L40S and RTX 5090 are the most cost-effective on-demand options for TTT-Linear up to 3B parameters.

At 1 inner-loop step, TTT-Linear adds roughly 15-25% latency overhead per token compared to the equivalent non-TTT forward pass. At 4 inner-loop steps (the default in the ttt-lm-pytorch reference implementation), latency overhead is 60-90%. At 8 steps, overhead exceeds 2x. The tradeoff is context-dependent quality improvement: TTT is most useful at long contexts (8K+ tokens) where the hidden-state adaptation accumulates meaningful signal.

As of mid-2026, vLLM does not natively support TTT-Linear models. Community projects like tLLM (github.com/LinesHogan/tLLM) provide test-time training extensions for vLLM, but they are not purpose-built for the TTT-Linear architecture. For production deployments, using ttt-lm-pytorch directly with a FastAPI wrapper is the most reliable path. Continuous batching with TTT would require per-request state management because each request maintains its own TTT hidden state.

Use TTT when you need inference-time adaptation without storing per-user adapter weights. TTT adapts to the current input context dynamically, which suits distribution-shift scenarios (each request has different domain vocabulary) and long-context tasks (code repositories, long documents). LoRA per-user adapters are better when users have consistent domain needs that can be captured in a trained adapter and the per-user fine-tuning cost is acceptable. TTT has higher per-token inference cost; LoRA has higher upfront fine-tuning cost.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.