How much VRAM does Chronos-T5 require to run in production?

Chronos-T5 Small (46M parameters) requires under 1 GB VRAM. Chronos-T5 Large (710M) needs approximately 2.5 GB. An L40S 48GB can serve multiple Chronos sizes simultaneously, with the full Small through Large family totaling ~4 GB VRAM combined.

Can I run TimesFM on an L40S or A100 GPU?

Yes. TimesFM-1.0-200M runs comfortably in under 2 GB VRAM. TimesFM-2.0-500M requires ~2 GB. Both models run well on an L40S 48GB or A100 80GB, leaving substantial headroom for large batch sizes.

How does Chronos compare to AWS Forecast on cost per million predictions?

AWS Forecast charges around $0.60 per 1,000 forecasted time series. Running Chronos-T5-Large on an L40S on Spheron at ~420 forecasts/sec ($0.72/hr) works out to roughly $0.0005 per 1,000 series at scale - over 1,000x cheaper for high-volume batch jobs.

What is zero-shot time series forecasting?

Zero-shot forecasting means applying a pretrained model directly to new time series data without any task-specific training. Models like Chronos and TimesFM are pretrained on large collections of public and synthetic time series, enabling them to generalize to unseen domains including retail demand, energy load, and IoT sensors.

When should I fine-tune a time series foundation model instead of using it zero-shot?

Zero-shot works well when your series have standard seasonality patterns and sufficient historical depth (100+ points). Fine-tune when your domain has unusual covariate relationships, non-standard frequencies (e.g., 15-minute IoT readings), or when you have 10,000+ proprietary series and you can measure a clear accuracy gap vs zero-shot on a held-out eval set.

Deploy Time Series Foundation Models on GPU Cloud: Chronos, Moirai, TimesFM, and Lag-Llama Production Setup Guide (2026)

Classical forecasting at scale is expensive and brittle. Maintaining separate ARIMA or Prophet models for thousands of SKUs requires constant retraining pipelines and per-series tuning. Time series foundation models change this: Chronos, Moirai, TimesFM, and Lag-Llama deliver zero-shot forecasts with no task-specific training, directly on new data your model has never seen. As of 2026, this approach is production-viable. Model accuracy on standard benchmarks (GIFT-Eval, LOTSA) now matches or exceeds domain-specific models, tooling has matured, and GPU cloud spot pricing makes batch forecasting jobs genuinely economical. This guide covers GPU setup end-to-end: hardware requirements, serving configs, throughput benchmarks, and cost comparison against AWS Forecast. For batch serving patterns that apply equally here, see batch LLM inference on GPU cloud.

What Are Time Series Foundation Models?

Classical approaches each have a tax. ARIMA requires per-series stationarity analysis and manual lag selection. Prophet handles seasonality well but needs holiday and regressor configuration for every series. XGBoost and LightGBM can generalize across series but require significant feature engineering: lag features, rolling statistics, calendar encodings, and series metadata.

Time series foundation models take a different path. They are pretrained on large, diverse collections of real-world and synthetic time series, from household electricity consumption to hospital admissions to financial markets. During pretraining, the model learns general patterns: seasonality at multiple frequencies, trend changes, noise levels, and distributional shapes. At inference time, you pass in a raw context window and request a forecast. No training required. No feature engineering. No stationarity tests.

Why 2026 is the inflection point:

Benchmark quality: GIFT-Eval and LOTSA benchmarks show zero-shot foundation models matching or beating per-series statistical models across most standard categories.
Tooling maturity: Hugging Face integration, BentoML support, and stable Python APIs make deployment straightforward.
GPU cloud economics: L40S on-demand instances at $0.72/hr make overnight batch forecasting jobs cheaper per million predictions than any managed forecasting API.

Model Comparison: Chronos, Moirai, TimesFM, Lag-Llama

Model	Org	Architecture	Parameters	Context Length	Output	License
Chronos-T5	Amazon	T5 encoder-decoder	46M / 200M / 710M	Up to 512 time steps	Quantile distribution	Apache 2.0
Chronos-Bolt	Amazon	Distilled T5	9M / 21M / 48M / 205M	Up to 512 time steps	Quantile distribution	Apache 2.0
Moirai	Salesforce	Encoder-decoder (Universal TS Transformer)	14M / 91M / 311M	Up to 5000 time steps	Distributional (Normal, NegBinomial, StudentT)	CC BY-NC 4.0
TimesFM	Google DeepMind	Decoder-only Transformer	200M / 500M	Up to 512 time points (1.0) / 2048 time points (2.0)	Point + quantile	Apache 2.0
Lag-Llama	Various	Llama decoder	2.5M	Configurable (lag features)	Probabilistic (Student-T)	Apache 2.0

Chronos-T5 uses a token vocabulary approach: it discretizes time series values into bins and treats forecasting as a sequence-to-sequence language task. This gives it broad zero-shot accuracy across diverse domains and clean probabilistic output via Monte Carlo sampling. Chronos-Bolt is a distilled version that is up to 250x faster and 20x more memory-efficient than Chronos-T5, making it the better default for latency-sensitive or resource-constrained deployments.

Moirai from Salesforce uses a patch-based universal encoder that handles multivariate series and irregular timestamps natively. It is the right choice when you have covariates (promotions, weather, economic indicators) or when your data has variable sampling rates like IoT sensor streams.

TimesFM from Google DeepMind uses a decoder-only transformer, which means autoregressive generation with no encoder pass. This makes it the fastest architecture for online inference where latency matters. TimesFM-2.0 extends the context length to 2048 time points and improves accuracy on longer horizons.

Lag-Llama uses the Llama backbone with lag features as input representations. Because the architecture is identical to LLMs you likely already fine-tune, LoRA and PEFT tooling applies directly. If you have 10,000+ proprietary series and need domain adaptation, Lag-Llama is the easiest to fine-tune.

GPU Hardware Requirements

VRAM estimates use the fp16 rule of thumb (2 bytes per parameter) with a 1.5x overhead multiplier for KV cache and activations.

Model Variant	Parameters	Min VRAM (fp16)	Recommended GPU	Max Batch Size (recommended GPU)
Chronos-Bolt-Small	9M	~0.1 GB	Any GPU	512
Chronos-T5-Small	46M	~0.5 GB	Any GPU with 4 GB+	512
Chronos-T5-Base	200M	~1 GB	Any GPU with 8 GB+	256
Chronos-T5-Large	710M	~2.5 GB	L40S / A100	128
Moirai-Small	14M	~0.3 GB	Any GPU	512
Moirai-Base	91M	~0.8 GB	Any GPU with 4 GB+	256
Moirai-Large	311M	~1.5 GB	L40S / A100	128
TimesFM-200M	200M	~1 GB	Any GPU with 4 GB+	256
TimesFM-500M	500M	~2 GB	L40S / A100	128
Lag-Llama-2.5M	2.5M	<0.1 GB	Any GPU with 4 GB+	512

The L40S 48GB and A100 80GB are the sweet spot for production time series workloads. With 48 GB and 80 GB of VRAM respectively, you can run all Chronos sizes simultaneously (Small + Base + Large totals ~4 GB) from a single instance, which matters for tiered SLA designs where you want to route simple forecasts to a smaller model and complex requests to Chronos-T5-Large. The A100 80GB also gives you headroom for Moirai-Large with large batch sizes and long context windows. See L40S vs A100 for a deeper GPU comparison including bandwidth and compute tradeoffs.

Time series foundation models are also a natural fit for GPU cloud requirements planning: Chronos-T5-Large (710M) tops out at ~2.5 GB VRAM, meaning even the largest Chronos variant sits comfortably in the L40S tier without requiring H100-class hardware.

Production Deployment: BentoML and Triton

BentoML Serving (Recommended for Most Teams)

BentoML wraps your model in a production-grade API with built-in batching, health checks, and Prometheus metrics. For Chronos:

python

import asyncio
import torch
import bentoml
import numpy as np
from chronos import ChronosPipeline
from typing import List

@bentoml.service(
    resources={"gpu": 1},
    traffic={"timeout": 30},
)
class ChronosService:
    def __init__(self):
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.pipeline = ChronosPipeline.from_pretrained(
            "amazon/chronos-t5-large",
            device_map=device,
            torch_dtype=torch.bfloat16,
        )

    @bentoml.api(batchable=True, max_batch_size=64, max_latency_ms=5000)
    async def predict(
        self,
        context: List[List[float]],
        prediction_length: int = 24,
        num_samples: int = 100,
    ) -> dict:
        tensors = [torch.tensor(s, dtype=torch.float32) for s in context]
        forecast = await asyncio.to_thread(
            self.pipeline.predict,
            tensors,
            prediction_length=prediction_length,
            num_samples=num_samples,
        )
        return {
            "p10": forecast.quantile(0.1, dim=1).tolist(),
            "p50": forecast.quantile(0.5, dim=1).tolist(),
            "p90": forecast.quantile(0.9, dim=1).tolist(),
        }

Build and serve:

bash

pip install chronos-forecasting bentoml torch

# Start dev server
bentoml serve service:ChronosService

# Build container for deployment
bentoml build
bentoml containerize chronos_service:latest

The service exposes a POST endpoint at /predict. BentoML handles request collation into batches up to max_batch_size=64 or until max_latency_ms=5000 elapses, whichever comes first.

NVIDIA Triton Inference Server (For Maximum Throughput)

Triton with dynamic batching is the better choice when you need sustained high-QPS serving and want fine-grained control over batching latency. Export Chronos to ONNX using Optimum, which captures the full encoder-decoder generation loop. Using torch.onnx.export on model.forward() traces only a single decoder step: the exported graph returns logits for one token, not a complete forecast. Generating a 24-step forecast requires running the decoder 24 times autoregressively, appending the argmax token to decoder_input_ids each step — that loop is absent from a torch.onnx.export graph. Use Optimum instead:

python

# pip install optimum[onnxruntime-gpu]
from optimum.onnxruntime import ORTModelForSeq2SeqLM

# Exports encoder, decoder, and decoder_with_past as separate ONNX graphs,
# preserving the full autoregressive generation loop.
model = ORTModelForSeq2SeqLM.from_pretrained(
    "amazon/chronos-t5-large",
    export=True,
)
model.save_pretrained("chronos_large_onnx/")
# Produces:
#   chronos_large_onnx/encoder_model.onnx
#   chronos_large_onnx/decoder_model.onnx
#   chronos_large_onnx/decoder_with_past_model.onnx

Seq2seq models require a Triton ensemble that orchestrates three ONNX graphs (encoder, decoder, decoder_with_past). The ensemble config.pbtxt:

name: "chronos_encoder"
backend: "onnxruntime"
max_batch_size: 128

dynamic_batching {
  max_queue_delay_microseconds: 1000
  preferred_batch_size: [32, 64, 128]
}

input [
  { name: "input_ids", data_type: TYPE_INT64, dims: [-1] },
  { name: "attention_mask", data_type: TYPE_INT64, dims: [-1] }
]
output [
  { name: "last_hidden_state", data_type: TYPE_FP32, dims: [-1, -1] }
]

The decoder and autoregressive loop are orchestrated by a Triton BLS (Business Logic Scripting) Python model that calls encoder_model, then iterates decoder_with_past_model for each forecast step. For teams without Triton BLS experience, BentoML is the simpler path: it calls pipeline.predict directly, which handles the full generation loop in Python before returning results.

Use Triton over BentoML when you are sustaining more than 500 requests/second and need dynamic batching overhead below 1ms. BentoML wins for teams that want simpler operations, built-in Python service logic, and faster iteration.

Batched Probabilistic Forecasting

Probabilistic output is what separates foundation models from point forecasters. Instead of a single predicted value at each horizon step, Chronos and Moirai return a distribution: you get P10, P50, and P90 quantiles, which map to "optimistic", "median", and "pessimistic" scenarios for planning purposes.

For Chronos, num_samples controls the number of Monte Carlo draws used to estimate quantiles. Higher values give more accurate quantile estimates at the cost of compute:

python

import torch
from chronos import ChronosPipeline

pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-large",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

# Batch of 32 series, 168 historical steps each
context = [torch.randn(168) for _ in range(32)]

forecast = pipeline.predict(
    context,
    prediction_length=24,
    num_samples=100,
)

p10 = forecast.quantile(0.1, dim=1)   # (32, 24) - lower bound
p50 = forecast.quantile(0.5, dim=1)   # (32, 24) - median
p90 = forecast.quantile(0.9, dim=1)   # (32, 24) - upper bound

# Compute quantile loss on held-out validation set
def quantile_loss(q, y_true, y_pred_q):
    errors = y_true - y_pred_q
    return torch.mean(torch.max(q * errors, (q - 1) * errors))

val_actuals = torch.randn(32, 24)  # (32, 24) ground truth values
ql_p50 = quantile_loss(0.5, val_actuals, p50)
ql_p90 = quantile_loss(0.9, val_actuals, p90)

For Moirai, the output distribution type depends on the data domain. Moirai automatically selects between Normal, NegativeBinomial (for count data like sales), and StudentT (for heavy-tailed distributions).

Latency and Throughput Benchmarks

Approximate throughput for Chronos-T5-Large at batch_size=32, prediction_length=24, num_samples=100 on Spheron GPU instances.

GPU	On-demand price/hr	Spot price/hr	Forecasts/sec (approx.)	Cost per 1M forecasts
L40S 48GB	$0.72	N/A	~420	~$0.48 (on-demand)
A100 80GB	$1.04	N/A (limited availability)	~380	~$0.76 (on-demand)
H100 80GB PCIe	$7.91	N/A	~650	~$3.38 (on-demand)

Throughput figures are approximate estimates based on parameter-count scaling and published benchmarks. Actual numbers vary with context length, num_samples, and driver versions. Profile with your own series lengths before capacity planning. A100 80GB spot capacity is currently constrained on Spheron; on-demand is the recommended tier for A100 workloads.

Pricing fluctuates based on GPU availability. The prices above are based on 11 May 2026 and may have changed. Check current GPU pricing for live rates.

The L40S is the most cost-effective tier for batch time series forecasting. Its 48 GB VRAM fits all Chronos and TimesFM variants with room for batch_size=128+. For mixed workloads where you run both LLM inference and time series forecasting on the same instance, the A100 80GB gives more headroom for concurrency. The H100 delivers highest throughput per GPU but at a price point that only makes sense for latency-sensitive real-time serving.

For context on how these GPU rates compare across providers, see the GPU cloud pricing comparison 2026.

Real-World Use Cases

Retail demand planning. Run Chronos-T5-Large overnight across 50,000 SKUs, forecasting 28-day demand for inventory positioning. At ~420 forecasts/sec on an L40S, the full 50K-SKU batch completes in roughly 2 minutes. AWS Forecast at $0.60 per 1,000 forecasted time series would cost around $30 for the same 50K series. The GPU job costs under $0.03.

Cloud capacity forecasting. Feed CPU and memory utilization metrics into TimesFM to predict utilization spikes 24-48 hours ahead. Autoscaling systems can pre-warm instances before traffic arrives instead of reacting after the spike. TimesFM's decoder-only architecture makes it fast enough for sub-minute refresh cycles on a single GPU.

IoT sensor anomaly detection. Moirai handles irregularly sampled sensor telemetry natively through its patch encoder. Run Moirai-Large over a fleet of industrial sensors, flag series where the P90 forecast deviates from observed values by more than 3 sigma, and route anomalous series to a human review queue. Moirai's multivariate support lets you include temperature, pressure, and vibration readings as covariates for better anomaly context.

Ad bidding and revenue forecasting. Generate 15-minute-resolution CTR and revenue forecasts for budget pacing in programmatic advertising. Low latency requirements (sub-second per series) suit TimesFM's decoder-only architecture. At 15-minute granularity with a 4-hour prediction horizon, you get 16 steps per series, and TimesFM processes batches of 1000 series in under a second on an L40S.

Fine-Tuning vs Zero-Shot: When Each Wins

Scenario	Recommendation
Standard retail seasonality, 100+ history points	Zero-shot (Chronos or TimesFM)
Irregular sampling rates, multivariate covariates	Zero-shot (Moirai)
Proprietary domain, 10K+ training series available	Fine-tune Lag-Llama with LoRA
Very short series (fewer than 50 points)	Classical (ARIMA, ETS) - foundation models need context
Real-time streaming with sub-second latency	TimesFM zero-shot (fastest architecture)
Latency-tolerant batch, want smallest possible model	Chronos-Bolt (up to 250x faster than T5, 20x more memory-efficient, similar accuracy)

For fine-tuning Lag-Llama with LoRA: the Llama backbone means standard PEFT tooling applies directly. You can fine-tune on a single L40S for domain adaptation, then serve the base model plus adapter weights from a single instance. See the LoRA multi-adapter serving guide for serving multiple fine-tuned adapters on one GPU.

Cost Comparison: GPU Cloud vs Managed Forecasting Services

Service	Pricing model	Cost per 1K time series	Notes
AWS Forecast	Per forecasted time series	$0.60	Minimum 1K time series, no spot pricing
GCP Vertex AI Forecasting	Per 1,000 rows at prediction	~$0.60	AutoML pricing
SageMaker Canvas (time series)	Session + inference	Variable	Session-based billing adds overhead
Chronos on Spheron L40S	GPU cost / forecasts	~$0.0005	At scale batch jobs (~420 forecasts/sec, $0.72/hr)

Managed services win for low-volume, occasional forecasting jobs where you do not want to manage infrastructure. Below roughly 100K forecasted time steps per month, the setup cost and operational overhead of self-hosting outweighs the per-step savings.

Self-hosting wins decisively above ~1 million forecasts per month. At that scale, the cost difference is over three orders of magnitude. Chronos-T5-Large on a single L40S processes millions of forecasted steps per hour at a fraction of managed API cost.

For the broader make-vs-buy analysis framework, see LLM inference on-premise vs GPU cloud.

Deploy Time Series Models on Spheron: Step-by-Step

Log in to app.spheron.ai. Navigate to GPU Marketplace and rent an L40S on Spheron for batch jobs or A100 on Spheron for larger models or mixed workloads. Choose spot pricing for overnight batch runs and on-demand for sustained serving.

SSH into the instance once provisioned. Verify CUDA with nvidia-smi. You should see the L40S or A100 listed with full VRAM available. For SSH key setup and instance configuration, see docs.spheron.ai.

Install dependencies:

bash

   pip install chronos-forecasting bentoml torch torchvision torchaudio
   # or for Moirai
   pip install uni2ts bentoml torch
   # or for TimesFM
   pip install timesfm[torch] bentoml

Sanity check the model on synthetic data:

python

   import torch
   from chronos import ChronosPipeline

   pipeline = ChronosPipeline.from_pretrained(
       "amazon/chronos-t5-large",
       device_map="cuda",
       torch_dtype=torch.bfloat16,
   )
   context = torch.randn(1, 168)
   forecast = pipeline.predict(context, prediction_length=24, num_samples=100)
   print(f"P50 forecast shape: {forecast.quantile(0.5, dim=1).shape}")  # should be (1, 24)

For batch production serving, run the BentoML service from the earlier section:

bash

   bentoml serve service:ChronosService --port 8080 --workers 1

Keep the service alive across sessions using screen or tmux:

bash

   screen -S forecaster
   bentoml serve service:ChronosService --port 8080
   # Detach with Ctrl+A, D

For overnight batch jobs on spot instances, checkpoint progress to disk after every N series so an interrupted job can resume. Write completed series indices to a file and skip them on restart. Use an atomic write (temp file + os.replace) so a preemption between open() and json.dump() does not leave an empty or partial checkpoint that crashes the next run:

python

   import json, os

   CHECKPOINT = "completed_series.json"
   if os.path.exists(CHECKPOINT):
       with open(CHECKPOINT) as f:
           completed = set(json.load(f))
   else:
       completed = set()

   for idx, series in enumerate(all_series):
       if idx in completed:
           continue
       result = pipeline.predict(series, prediction_length=24)
       save_result(idx, result)
       completed.add(idx)
       tmp = CHECKPOINT + ".tmp"
       with open(tmp, "w") as f:
           json.dump(list(completed), f)
       os.replace(tmp, CHECKPOINT)  # atomic swap — safe on preemption

Pitfalls and Known Limitations

Covariate handling: Chronos and Lag-Llama are univariate-only. Each series is predicted independently with no external regressors. Moirai supports covariates via its patch encoder. If promotions, holidays, or external signals are important for your forecast accuracy, use Moirai or pre-encode the covariates into the time series context before passing it to Chronos.

Irregular timestamps: Standard Chronos assumes regular frequency. Pass irregularly sampled data to Chronos and it will misinterpret the gaps as part of the signal. Use Moirai for series with gaps or variable sampling rates.

Long-horizon degradation: All four models degrade beyond their trained maximum prediction length. Chronos-T5 at prediction_length > 64 shows increasing uncertainty spread. TimesFM is more stable to 512 steps. Validate on your held-out horizon before committing to production.

Cold-start on very short series: Models need at least 50-100 historical observations for reliable zero-shot output. Below that threshold, statistical methods (Holt-Winters, ARIMA) typically outperform foundation models. Do not use zero-shot foundation models on series with fewer than 50 data points without validating on held-out data.

Memory pressure with large batches and long context: Moirai with context_length=5000 and batch_size=64 will OOM on a 48 GB card. In practice, context above 1,000 steps with batch_size > 16 is risky. Profile memory before setting production batch sizes:

python

  torch.cuda.memory_summary(device="cuda", abbreviated=True)

Reduce batch_size if you see memory pressure warnings.

Chronos-Bolt vs Chronos-T5: Chronos-Bolt (amazon/chronos-bolt-small, amazon/chronos-bolt-base, etc.) is the distilled family. It is up to 250x faster and 20x more memory-efficient than Chronos-T5, at a modest accuracy tradeoff. For most production batch jobs, Chronos-Bolt-Base is the better default: faster, cheaper, and accurate enough for standard seasonality patterns.

Moirai commercial license restriction: Moirai is licensed under CC BY-NC 4.0, which prohibits commercial use without a commercial license from Salesforce. Do not deploy Moirai in revenue-generating production pipelines without confirming your licensing compliance. Chronos, TimesFM, and Lag-Llama are Apache 2.0 and commercially unrestricted.

Time series foundation models shift batch forecasting from a managed-service line item to a self-hosted GPU workload. For overnight demand planning runs across thousands of SKUs, a single L40S on-demand instance on Spheron delivers cost-per-million-forecasts well below AWS Forecast. Rent an A100 80GB for larger models or mixed workloads.
Start forecasting on Spheron →

What Are Time Series Foundation Models?

Model Comparison: Chronos, Moirai, TimesFM, Lag-Llama

GPU Hardware Requirements

Production Deployment: BentoML and Triton

BentoML Serving (Recommended for Most Teams)

NVIDIA Triton Inference Server (For Maximum Throughput)

Batched Probabilistic Forecasting

Latency and Throughput Benchmarks

Real-World Use Cases

Fine-Tuning vs Zero-Shot: When Each Wins

Cost Comparison: GPU Cloud vs Managed Forecasting Services

Deploy Time Series Models on Spheron: Step-by-Step

Pitfalls and Known Limitations

Build what's next.